Query generation using snowflake
In this wiki we explain the algorithm to collect a random subgraph to create a unique join signature and adding predicates.
Index
Input toml
An input TOML for snowflake looks like:
# Dataset for accesssing the histograms
dataset = "TPCDS"
# Max ammounts of hops for the subgraph generation
max_hops = 3
# Total amount of queries for each fact table in each batch
max_queries_per_fact_table = 100
# Total amount of query signature for each fact table in each batch
max_queries_per_signature = 1
# probability to keep an edge when generating a subgraph
keep_edge_probability = 0.2
[predicate_parameters]
# minimum probability to aim for when doing a range predicate
# using histograms to estimate the selectivity
row_retention_probability = 0.2
# minimum probability to keep an element in the IN-statement
# or = statement
equality_lower_bound_probability = 0.01
# number of values in the IN-statement
extra_values_for_in = 3
# number of extra predicates to add
extra_predicates = 3
# weights to choose operators when selecting a random operator
[predicate_parameters.operator_weights]
operator_in = 1
operator_range = 3
operator_equal = 3
The first global part relates to the subgraph generation. The
predicate_parameters
control how we add predicates.
Subgraph
To generate a subgraph we start with a fact table and then do a
BFS where we select to proceed with an edge by following the
keep_edge_probability
, then we repeat this up to the hops
allowed by max_hops
.
Subgraph generator example
Let us have the following schema
define node {circle "Dim." $1 at ($2,$3) radius 0.3} FT: circle "Fact" "Table" at (0,0) radius 0.3 DT1: node("Table 1",1,0) DT11: node("Table 1_1",2,0) DT2: node("Table 2",-1,0) DT3: node("Table 3",0,1) DT4: node("Table 4",0,-1) DT41: node("Table 4_1", -1, -2) DT42: node("Table 4_2", 0, -2) DT43: node("Table 4_3", 1, -2) arrow from FT.e to DT1.w arrow from DT1.e to DT11.w arrow from FT.w to DT2.e arrow from FT.n to DT3.s arrow from FT.s to DT4.n arrow from DT4.s to DT41.n arrow from DT4.s to DT42.n arrow from DT4.s to DT43.n→ /pikchrshow
We want to select a random subgraph. Our starting point is always the fact table.
- In gray we will mark the nodes that are not in the graph.
- In red the nodes we already processed and will not visit in the future
- In dash lines the node that we are currently processing
- In solid black lines the nodes and edges that we will keep for the subgraph
We start with the fact table,
define node {circle "Dim." $1 at ($2,$3) radius 0.3 color gray} FT: circle "Fact" "Table" at (0,0) radius 0.3 dashed DT1: node("Table 1",1,0) DT11: node("Table 1_1",2,0) DT2: node("Table 2",-1,0) DT3: node("Table 3",0,1) DT4: node("Table 4",0,-1) DT41: node("Table 4_1", -1, -2) DT42: node("Table 4_2", 0, -2) DT43: node("Table 4_3", 1, -2) arrow from FT.e to DT1.w color gray arrow from DT1.e to DT11.w color gray arrow from FT.w to DT2.e color gray arrow from FT.n to DT3.s color gray arrow from FT.s to DT4.n color gray arrow from DT4.s to DT41.n color gray arrow from DT4.s to DT42.n color gray arrow from DT4.s to DT43.n color gray→ /pikchrshow
We follow with visiting all the neighbours of the current table. Since we always add the fact table we keep it. We start with Dimension table 2, we now want to decide if we keep or we throw away our node. We decide randomly with a probability of keep_edge_prob
to keep or not a node. Let us suppose that we don't keep this node and go to the following node which is node 2
define node {circle "Dim." $1 at ($2,$3) radius 0.3 color gray} define curr_node{circle "Dim." $1 at ($2,$3) radius 0.3 dashed} FT: circle "Fact" "Table" at (0,0) radius 0.3 DT1: curr_node("Table 1",1,0) DT11: node("Table 1_1",2,0) DT2: node("Table 2",-1,0) DT3: node("Table 3",0,1) DT4: node("Table 4",0,-1) DT41: node("Table 4_1", -1, -2) DT42: node("Table 4_2", 0, -2) DT43: node("Table 4_3", 1, -2) arrow from FT.e to DT1.w dashed arrow from DT1.e to DT11.w color gray arrow from FT.w to DT2.e color gray arrow from FT.n to DT3.s color gray arrow from FT.s to DT4.n color gray arrow from DT4.s to DT41.n color gray arrow from DT4.s to DT42.n color gray arrow from DT4.s to DT43.n color gray→ /pikchrshow
Now we are with dimension table 2. And we decide to keep it or not. Lets say we keep it
define node {circle "Dim." $1 at ($2,$3) radius 0.3 color gray} define curr_node{circle "Dim." $1 at ($2,$3) radius 0.3 dashed} FT: circle "Fact" "Table" at (0,0) radius 0.3 DT1: node("Table 1",1,0) DT11: node("Table 1_1",2,0) DT2: curr_node("Table 2",-1,0) DT3: node("Table 3",0,1) DT4: node("Table 4",0,-1) DT41: node("Table 4_1", -1, -2) DT42: node("Table 4_2", 0, -2) DT43: node("Table 4_3", 1, -2) arrow from FT.e to DT1.w color red arrow from DT1.e to DT11.w color gray arrow from FT.w to DT2.e dashed arrow from FT.n to DT3.s color gray arrow from FT.s to DT4.n color gray arrow from DT4.s to DT41.n color gray arrow from DT4.s to DT42.n color gray arrow from DT4.s to DT43.n color gray→ /pikchrshow
We now explore Dimension table 2, since there are no more edges spawning from Dim. Table 2 we keep going with Dim. Table 3. Lets say we don't keep it:
define node {circle "Dim." $1 at ($2,$3) radius 0.3 color gray} define curr_node{circle "Dim." $1 at ($2,$3) radius 0.3 dashed} define good_node{circle "Dim." $1 at ($2,$3) radius 0.3} FT: circle "Fact" "Table" at (0,0) radius 0.3 DT1: node("Table 1",1,0) DT11: node("Table 1_1",2,0) DT2: good_node("Table 2",-1,0) DT3: curr_node("Table 3",0,1) DT4: node("Table 4",0,-1) DT41: node("Table 4_1", -1, -2) DT42: node("Table 4_2", 0, -2) DT43: node("Table 4_3", 1, -2) arrow from FT.e to DT1.w color red arrow from DT1.e to DT11.w color gray arrow from FT.w to DT2.e arrow from FT.n to DT3.s dashed arrow from FT.s to DT4.n color gray arrow from DT4.s to DT41.n color gray arrow from DT4.s to DT42.n color gray arrow from DT4.s to DT43.n color gray→ /pikchrshow
We now go on to Dimension table 4 and lets say we keep it.
define node {circle "Dim." $1 at ($2,$3) radius 0.3 color gray} define curr_node{circle "Dim." $1 at ($2,$3) radius 0.3 dashed} define good_node{circle "Dim." $1 at ($2,$3) radius 0.3} FT: circle "Fact" "Table" at (0,0) radius 0.3 DT1: node("Table 1",1,0) DT11: node("Table 1_1",2,0) DT2: good_node("Table 2",-1,0) DT3: node("Table 3",0,1) DT4: curr_node("Table 4",0,-1) DT41: node("Table 4_1", -1, -2) DT42: node("Table 4_2", 0, -2) DT43: node("Table 4_3", 1, -2) arrow from FT.e to DT1.w color red arrow from DT1.e to DT11.w color gray arrow from FT.w to DT2.e arrow from FT.n to DT3.s color red arrow from FT.s to DT4.n dashed arrow from DT4.s to DT41.n color gray arrow from DT4.s to DT42.n color gray arrow from DT4.s to DT43.n color gray→ /pikchrshow
Now we have to iterate over the nodes of Dim. Table 4. So we first add Dim. Table 4 and start looping from Dim. Table 4_1 and so on
If we had a max. limits of hops of 1, we would stop here since only one hope is allowed, But lets keep going to the children of Dim. Table 4
define node {circle "Dim." $1 at ($2,$3) radius 0.3 color gray} define curr_node{circle "Dim." $1 at ($2,$3) radius 0.3 dashed} define good_node{circle "Dim." $1 at ($2,$3) radius 0.3} FT: circle "Fact" "Table" at (0,0) radius 0.3 DT1: node("Table 1",1,0) DT11: node("Table 1_1",2,0) DT2: good_node("Table 2",-1,0) DT3: node("Table 3",0,1) DT4: good_node("Table 4",0,-1) DT41: curr_node("Table 4_1", -1, -2) DT42: node("Table 4_2", 0, -2) DT43: node("Table 4_3", 1, -2) arrow from FT.e to DT1.w color red arrow from DT1.e to DT11.w color gray arrow from FT.w to DT2.e arrow from FT.n to DT3.s color red arrow from FT.s to DT4.n arrow from DT4.s to DT41.n dashed arrow from DT4.s to DT42.n color gray arrow from DT4.s to DT43.n color gray→ /pikchrshow
Now we do the same thing. Lets say we keep Dim table 4_1, and Dim. table 4_3 while not keeping dim_table 4_2. We end up with this
define node {circle "Dim." $1 at ($2,$3) radius 0.3 color gray} define curr_node{circle "Dim." $1 at ($2,$3) radius 0.3 dashed} define good_node{circle "Dim." $1 at ($2,$3) radius 0.3} FT: circle "Fact" "Table" at (0,0) radius 0.3 DT1: node("Table 1",1,0) DT11: node("Table 1_1",2,0) DT2: good_node("Table 2",-1,0) DT3: node("Table 3",0,1) DT4: good_node("Table 4",0,-1) DT41: good_node("Table 4_1", -1, -2) DT42: node("Table 4_2", 0, -2) DT43: good_node("Table 4_3", 1, -2) arrow from FT.e to DT1.w color red arrow from DT1.e to DT11.w color gray arrow from FT.w to DT2.e arrow from FT.n to DT3.s color red arrow from FT.s to DT4.n arrow from DT4.s to DT41.n arrow from DT4.s to DT42.n color red arrow from DT4.s to DT43.n→ /pikchrshow
Which translates to this subgraph
define node {circle "Dim." $1 at ($2,$3) radius 0.3 color gray} define curr_node{circle "Dim." $1 at ($2,$3) radius 0.3 dashed} define good_node{circle "Dim." $1 at ($2,$3) radius 0.3} FT: circle "Fact" "Table" at (0,0) radius 0.3 DT2: good_node("Table 2",-1,0) DT4: good_node("Table 4",0,-1) DT41: good_node("Table 4_1", -1, -2) DT43: good_node("Table 4_3", 1, -2) arrow from FT.w to DT2.e arrow from FT.s to DT4.n arrow from DT4.s to DT41.n arrow from DT4.s to DT43.n→ /pikchrshow
Right now we can transform our subgraph to a SQL query by using:
SELECT COUNT(*)
FROM
fact_table ft,
dim_table_2 dt2,
dim_table_4 dt4,
dim_table_4_1 dt41,
dim_table_4_3 dt43
WHERE
ft.fk_dt4 = dt4.ref_fk AND
ft.fk_dt2 = dt2.ref_fk AND
dt4.fk_dt41 = dt41.ref_fk AND
dt4.fk_dt43 = dt43.ref_fk AND
Adding predicates
In our example we have added the subgraph which translates to this query:
SELECT COUNT(*)
FROM
fact_table ft,
dim_table_2 dt2,
dim_table_4 dt4,
dim_table_4_1 dt41,
dim_table_4_3 dt43
WHERE
ft.fk_dt4 = dt4.ref_fk AND
ft.fk_dt2 = dt2.ref_fk AND
dt4.fk_dt41 = dt41.ref_fk AND
dt4.fk_dt43 = dt43.ref_fk AND
We now want to add some predicates that make sense. We want to use the equidepth histogram of predicates.
We select extra_predicates
number of predicates to add using histograms with row_retention_probability
If extra_predicates
is 2, this means we will choose 2 random columns to do the procedure.
The variable row_retention_probability
is the filter by which we will select a histogram.
Lets say we have a histogram h1 = [0,10,20,30,40,50,60,70,80,90,100]
and we want to select 20%
Thus we select a random position. Say 70, and we add to positions. thus we end up with a range of [70,90]
Which transforms our query to:
SELECT COUNT(*)
FROM
fact_table ft,
dim_table_2 dt2,
dim_table_4 dt4,
dim_table_4_1 dt41,
dim_table_4_3 dt43
WHERE
ft.fk_dt4 = dt4.ref_fk AND
ft.fk_dt2 = dt2.ref_fk AND
dt4.fk_dt41 = dt41.ref_fk AND
dt4.fk_dt43 = dt43.ref_fk AND
70 <= ft.col1 AND
90 <= ft.col1
If we select another column it will be like:
SELECT COUNT(*)
FROM
fact_table ft,
dim_table_2 dt2,
dim_table_4 dt4,
dim_table_4_1 dt41,
dim_table_4_3 dt43
WHERE
ft.fk_dt4 = dt4.ref_fk AND
ft.fk_dt2 = dt2.ref_fk AND
dt4.fk_dt41 = dt41.ref_fk AND
dt4.fk_dt43 = dt43.ref_fk AND
70 <= ft.col1 AND
90 <= ft.col1 AND
6.5 <= dt4.col2 AND
9 <= dt4.col2
We also want to add other predicates like the IN-predicate and the =
predicate. For these predicates we use Most-common values (MCV). We know
some variables and their frequency. We can use MCV to select values that
fulfill the minimum lower bound for = and IN-statements set by
equality_lower_bound_probability
. Say we know that the column
ft.col2 has the following frequencies (normalized)
- 'a' 50%
- 'b' 20%
- 'c' 10%
Then if our equality_lower_bound_probability
is 0.2 we will include b
and a
but not c
. Thus we could create the following query:
SELECT COUNT(*)
FROM
fact_table ft,
dim_table_2 dt2,
dim_table_4 dt4,
dim_table_4_1 dt41,
dim_table_4_3 dt43
WHERE
ft.fk_dt4 = dt4.ref_fk AND
ft.fk_dt2 = dt2.ref_fk AND
dt4.fk_dt41 = dt41.ref_fk AND
dt4.fk_dt43 = dt43.ref_fk AND
ft.col1 = 'a' AND
ft.col1 IN ('b', 'f', 't')
Where there rest of the elements of the IN-statement are taken from the histogram bounds.
only the first one has to fulfill the condition set by equality_lower_bound_probability
.
Overall pipeline
The Snowflake algorithm in broad terms looks like this
down circle "start" fit arrow box "choose fact" "table randomly" fit arrow box "generate subgraph" fit arrow LS: circle "Begin" "predicate" "generation" fit arrow box "Pick random column." fit arrow D: diamond "Choose" "Operator" fit arrow "Range" aligned above box "Generate a range" "predicate according to" "row retention probability" fit arrow LE: circle "End" "predicate" "generation" fit arrow box "Query Builder" "makes the SQL" "statement" fit arrow circle "End" L1: line from LE.e to (LE + (3,0)) L2: line from (LS + (3,0)) to L1.e "Repeats " aligned "Up to extra_predicates times" aligned L3: arrow from L2.n to LS.e E: box at (D + (1.5,-1.2)) "Generate Equal" "With one MCV" fit arrow from D.e to E.n "Equal" aligned above arrow from E.s to LE.n IN: box at (D + (-1.9,-1.2)) "Generate IN" "With one MCV" "and random elements" fit line from IN.n to D.w "IN" aligned above arrow from D.w to IN.n arrow from IN.s to LE.n→ /pikchrshow