File docs/query_generation/pipeline.md artifact 51ff513b69 on branch docs
S: circle "start" fit arrow MDB: box "generate-db" fit arrow P: box "synthetic-queries" fit arrow F1: box "filter-synthetic" fit arrow LLM: box "extensions-and-llm" fit arrow circle "dataset" "ready" fit H: box "make-histograms" fit at (P + (0,-2)) arrow dashed from S.s to H.nw "for non" aligned "precomputed DB" aligned F: file "histogram" at ( P + (1.5,-1.2)) fit arrow from H.n to F.s "generates" aligned "" arrow <- from P.s to F.n "uses" aligned "" spline from S.n up 0.5 then right until even with P.nw then to P.nw dashed -> text "If database is not supported" at (MDB + (0,0.74)) "" "for generation" DB: cylinder "DuckDB Database" fit at (P + (0,-1)) arrow from P.s to DB.n "run" aligned "queries" aligned arrow from MDB.s to DB.nw "generates" aligned "DuckDB database" aligned arrow from H.n to DB.s "queries" aligned "" spline <- from DB.ne right 0.5 then to LLM.sw dashed "Syntax Check" aligned PARQUET: file "cardinalities" fit at (P + (0.5,1.3)) arrow from P.n to PARQUET.sw "generates" aligned "parquet" aligned arrow from PARQUET.se to F1.n "uses" aligned "cardinalities" aligned→ /pikchrshow
Example Full run
We follow a small example for explaining this steps. We invite the reader
to run pixi run main --help to get documentation of the existing endpoints
and to run pixi run main {endpoint} --help to get documentation of each
endpoint.
- Generate TPC-DS
- Make histograms of the generated database in step 1
- Generate queries with the database and the histograms of the previous steps.
- Filter the generated synthetic queries.
- Augment them using llm and unions.
Summary
Just in case you just want the commands to run the examples:
pixi run main generate-db -c params_config/generate_db/tpcds_dev.toml
pixi run main make-histograms -c params_config/make_histograms/tpcds_dev.toml
pixi run main synthetic-queries -c params_config/synthetic_queries/tpcds_dev.toml
pixi run main filter-synthetic -c params_config/filter_synthetic/filter_tpcds_dev.toml
pixi run main extensions-and-llm -c params_config/extensions_and_llms/tpcds_dev.toml
Generate TPCDS
We choose the generate-db endpoint to generate the data, and we pass
the params_config/generate_db/tpcds_dev.toml configuration to
generate a small TPCDS of scale factor 0.1. The toml contains all the
information of input. If you want to generate a different size of TPC-DS
you can just change the toml scale factor and run the same command.
pixi run main generate-db -c params_config/generate_db/tpcds_dev.toml
The toml used was:
dataset = "TPCDS"
scale_factor = 0.1
db_path = "tmp/database_TPCDS_0.1.duckdb"
Make histograms
Same as before we run
pixi run main make-histograms -c params_config/make_histograms/tpcds_dev.toml
with the toml file:
output_folder = "tmp/histograms/"
database_path = "tmp/database_TPCDS_0.1.duckdb"
histogram_size = 51
common_values_size = 10
include_mcv = true
Make Synthetic Queries
We now run the query generation with
pixi run main synthetic-queries -c params_config/synthetic_queries/tpcds_dev.toml
duckdb_database = "tmp/database_TPCDS_0.1.duckdb"
dataset = "TPCDS"
output_folder = "tmp/synthetic_queries"
max_hops = [1]
extra_predicates = [5]
row_retention_probability = [0.2, 0.9]
unique_joins = true
max_queries_per_fact_table = 1
max_queries_per_signature = 2
keep_edge_probability = [0.2]
equality_lower_bound_probability = [0,0.1]
extra_values_for_in = 3
[operator_weights]
operator_in = 1
operator_range = 3
operator_equal = 3
Filtering the queries
To filter the queries we use
pixi run main filter-synthetic -c params_config/filter_synthetic/filter_tpcds_dev.toml
This will also split the queries by their query signature.
input_parquet = "tmp/synthetic_queries/output.parquet"
destination_folder = "tmp/filtered_queries"
filter_null = true
cherry_pick = false
Extension and LLM
Finally we can do extensions for extra relational algebra operators. This extension takes as input the filtered queries. To run
pixi run main extensions-and-llm -c params_config/extensions_and_llms/tpcds_dev.toml
This will generate the union and llm extension, the provided toml is:
llm_extension = true
union_extension = true
database_path = "data/duckdb/TPCDS/0.db"
queries_parquet = "tmp/filtered_queries/filtered.parquet"
destination_folder = "tmp/extended_queries"
[union_params]
max_queries = 5
probability = 0.7
[llm_params]
retry = 1
total_queries = 5
llm_model = "deepseek-r1:1.5b"
llm_base_prompt = """
You are writing queries for a markdown text using \
the format:sql for correct formatting in markdown. the schema
is: ...
"""
[llm_params.llm_prompts.self_join]
prompt = "write this query with a self join"
weight = 30
[llm_params.llm_prompts.outer_join]
prompt = "write this query with an outer join"
weight = 30