Index
- Snowflake - Our simplest method
- Running multiple Snowflake the
search-params
endpoint - Cherry pick diverse data.
- Filter null.
- LLM extension
- Union queries
Snowflake
Our simplest method of generating queries is based solely on stats.
For an in depth documentation check the snowflake wiki.
You would only need to run:
pixi run main snowflake -c params_config/snowflake/tpcds.toml
To get the files. The files will be saved in
./data/generated_queries/SNOWFLAKE/TPCDS
If you wish to use any other dataset you would have to compute
the statistics using the make-histograms
endpoint beforehand.
S: circle "Start" fit arrow P: box "snowflake" fit arrow circle "dataset" "ready" fit H: box "make-histograms" fit at (P + (0,-2)) arrow from S.s to H.nw "optional" aligned "" F: file "histogram" at ( P + (0,-1)) fit arrow from H.n to F.s "generates" aligned "" arrow from P.s to F.n "uses" aligned ""→ /pikchrshow
Parameter searching - Running and validating multiple Snowflake
Parameter searching iterates over snowflake by running the queries on
a database to get the queries cardinalities. This generates a
file with the information of cardinalities. Following endpoints
can use this file to sample a different set from the generated
queries from param-search
.
Param search works by running multiple parameters of the
snowflake
endpoints. Thus running search-params
is equivalent
to running multiple snowflake
with different parameters and
running and saving the results with a database.
S: circle "Start" fit arrow P: box "param-search" fit arrow circle "dataset" "ready" fit H: box "make-histograms" fit at (P + (0,-2)) arrow from S.s to H.nw "optional" aligned "" F: file "histogram" at ( P + (0,-1)) fit arrow from H.n to F.s "generates" aligned "" arrow from P.s to F.n "uses" aligned "" DB: cylinder "DuckDB Database" fit at (P + (0,1)) arrow from P.n to DB.s "run" aligned "queries" aligned PARQUET: file "cardinalities" fit at (P + (1.5,1)) arrow from P.ne to PARQUET.sw "generates" aligned "parquet" aligned→ /pikchrshow
Cherry pick
Cherry pick is only a filter to the search-params
endpoint to sample queries according to the cardinality.
S: circle "Start" fit arrow P: box "param-search" fit arrow color red F1: box "cherry-pick" fit color red arrow circle "dataset" "ready" fit H: box "make-histograms" fit at (P + (0,-2)) arrow from S.s to H.nw "optional" aligned "" F: file "histogram" at ( P + (0,-1)) fit arrow from H.n to F.s "generates" aligned "" arrow from P.s to F.n "uses" aligned "" DB: cylinder "DuckDB Database" fit at (P + (0,1)) arrow from P.n to DB.s "run" aligned "queries" aligned PARQUET: file "cardinalities" fit at (P + (1.5,1)) arrow from P.ne to PARQUET.sw "generates" aligned "parquet" aligned arrow from PARQUET.s to F1.n line to PARQUET.s from F1.n "input" aligned "for filter" aligned→ /pikchrshow
Using Cherry-pick there is first a stage of running the queries with search-params
and then we use the cardinalities of the answer set to sample from equi-width
bins.
Filter null
Filter null is a filter to the search-params
endpoint to sample queries according to the cardinality.
Filter null deletes the queries that have an empty query set.
S: circle "Start" fit arrow P: box "param-search" fit arrow color red F1: box "filter-null" fit color red arrow circle "dataset" "ready" fit H: box "make-histograms" fit at (P + (0,-2)) arrow from S.s to H.nw "optional" aligned "" F: file "histogram" at ( P + (0,-1)) fit arrow from H.n to F.s "generates" aligned "" arrow from P.s to F.n "uses" aligned "" DB: cylinder "DuckDB Database" fit at (P + (0,1)) arrow from P.n to DB.s "run" aligned "queries" aligned PARQUET: file "cardinalities" fit at (P + (1.5,1)) arrow from P.ne to PARQUET.sw "generates" aligned "parquet" aligned arrow from PARQUET.s to F1.n line to PARQUET.s from F1.n "input" aligned "for filter" aligned→ /pikchrshow
LLM pipeline
Once we have a set of queries we want to augment using LLMs we can use
the LLM endpoint add-complex-queries
.
For a more in detail description see the LLM wiki
S: circle "Start" fit arrow P: box "param-search" fit arrow D: diamond "Choose" "filter" fit F1: box "cherry-pick" fit at (D + (2,1)) arrow from D.e to F1.w F2: box "filter-null" fit at (D + (2,0)) arrow from D.e to F2.w F3: box "no filter" fit at (D + (2,-1)) arrow from D.e to F3.w LLM: box "add-complex-queries" fit at (F2 + (2,0)) arrow circle "dataset" "ready" arrow from F1.e to LLM.w arrow from F2.e to LLM.w arrow from F3.e to LLM.w H: box "make-histograms" fit at (P + (0,-2)) arrow from S.s to H.nw "optional" aligned "" F: file "histogram" at ( P + (0,-1)) fit arrow from H.n to F.s "generates" aligned "" arrow from P.s to F.n "uses" aligned "" DB: cylinder "DuckDB Database" fit at (P + (0,1)) arrow from P.n to DB.s "run" aligned "queries" aligned PARQUET: file "cardinalities" fit at (P + (1.5,1)) arrow from P.ne to PARQUET.sw "generates" aligned "parquet" aligned arrow dashed from PARQUET.e to F1.w "input" "" arrow dashed from PARQUET.e to F2.w "input " aligned ""→ /pikchrshow
Union pipeline
Union works after the filters. Generating a union dataset.
S: circle "Start" fit arrow P: box "param-search" fit arrow D: diamond "Choose" "filter" fit F1: box "cherry-pick" fit at (D + (2,1)) arrow from D.e to F1.w F2: box "filter-null" fit at (D + (2,0)) arrow from D.e to F2.w F3: box "no filter" fit at (D + (2,-1)) arrow from D.e to F3.w LLM: box "union-queries" fit at (F2 + (2,0)) arrow circle "dataset" "ready" arrow from F1.e to LLM.w arrow from F2.e to LLM.w arrow from F3.e to LLM.w H: box "make-histograms" fit at (P + (0,-2)) arrow from S.s to H.nw "optional" aligned "" F: file "histogram" at ( P + (0,-1)) fit arrow from H.n to F.s "generates" aligned "" arrow from P.s to F.n "uses" aligned "" DB: cylinder "DuckDB Database" fit at (P + (0,1)) arrow from P.n to DB.s "run" aligned "queries" aligned PARQUET: file "cardinalities" fit at (P + (1.5,1)) arrow from P.ne to PARQUET.sw "generates" aligned "parquet" aligned arrow dashed from PARQUET.e to F1.w "input" "" arrow dashed from PARQUET.e to F2.w "input " aligned ""→ /pikchrshow