Query generation using snowflake

In this wiki we explain the algorithm to collect a random subgraph to create a unique join signature and adding predicates.

Index

Input Toml explained
Subgraph generation
Subgraph example
Adding predicates

Input toml

An input TOML for snowflake looks like:

# Dataset for accesssing the histograms
dataset = "TPCDS"
# Max ammounts of hops for the subgraph generation
max_hops = 3
# Total amount of queries for each fact table in each batch
max_queries_per_fact_table = 100
# Total amount of query signature for each fact table in each batch
max_queries_per_signature = 1
# probability to keep an edge when generating a subgraph
keep_edge_probability = 0.2



[predicate_parameters]
# minimum probability to aim for when doing a range predicate
# using histograms to estimate the selectivity
row_retention_probability = 0.2
# minimum probability to keep an element in the IN-statement
# or = statement
equality_lower_bound_probability = 0.01
# number of values in the IN-statement
extra_values_for_in = 3
# number of extra predicates to add
extra_predicates = 3

# weights to choose operators when selecting a random operator
[predicate_parameters.operator_weights]
operator_in = 1
operator_range = 3
operator_equal = 3

The first global part relates to the subgraph generation. The predicate_parameters control how we add predicates.

Subgraph

To generate a subgraph we start with a fact table and then do a BFS where we select to proceed with an edge by following the keep_edge_probability, then we repeat this up to the hops allowed by max_hops.

Subgraph generator example

Let us have the following schema

define node {circle "Dim." $1 at ($2,$3) radius 0.3}

FT: circle "Fact" "Table" at (0,0) radius 0.3

DT1: node("Table 1",1,0)
DT11: node("Table 1_1",2,0)
DT2: node("Table 2",-1,0)
DT3: node("Table 3",0,1)
DT4: node("Table 4",0,-1)
DT41: node("Table 4_1", -1, -2)
DT42: node("Table 4_2", 0, -2)
DT43: node("Table 4_3", 1, -2)


arrow from FT.e to DT1.w
arrow from DT1.e to DT11.w
arrow from FT.w to DT2.e
arrow from FT.n to DT3.s
arrow from FT.s to DT4.n
arrow from DT4.s to DT41.n
arrow from DT4.s to DT42.n
arrow from DT4.s to DT43.n

We want to select a random subgraph. Our starting point is always the fact table.

In gray we will mark the nodes that are not in the graph.
In red the nodes we already processed and will not visit in the future
In dash lines the node that we are currently processing
In solid black lines the nodes and edges that we will keep for the subgraph

We start with the fact table,

define node {circle "Dim." $1 at ($2,$3) radius 0.3 color gray}

FT: circle "Fact" "Table" at (0,0) radius 0.3 dashed

DT1: node("Table 1",1,0)
DT11: node("Table 1_1",2,0)
DT2: node("Table 2",-1,0)
DT3: node("Table 3",0,1)
DT4: node("Table 4",0,-1)
DT41: node("Table 4_1", -1, -2)
DT42: node("Table 4_2", 0, -2)
DT43: node("Table 4_3", 1, -2)

arrow from FT.e to DT1.w color gray
arrow from DT1.e to DT11.w color gray
arrow from FT.w to DT2.e color gray
arrow from FT.n to DT3.s color gray
arrow from FT.s to DT4.n color gray
arrow from DT4.s to DT41.n color gray
arrow from DT4.s to DT42.n color gray
arrow from DT4.s to DT43.n color gray

We follow with visiting all the neighbours of the current table. Since we always add the fact table we keep it. We start with Dimension table 2, we now want to decide if we keep or we throw away our node. We decide randomly with a probability of keep_edge_prob to keep or not a node. Let us suppose that we don't keep this node and go to the following node which is node 2

define node {circle "Dim." $1 at ($2,$3) radius 0.3 color gray}
define curr_node{circle "Dim." $1 at ($2,$3) radius 0.3 dashed}

FT: circle "Fact" "Table" at (0,0) radius 0.3 

DT1: curr_node("Table 1",1,0)
DT11: node("Table 1_1",2,0)
DT2: node("Table 2",-1,0) 
DT3: node("Table 3",0,1)
DT4: node("Table 4",0,-1)
DT41: node("Table 4_1", -1, -2)
DT42: node("Table 4_2", 0, -2)
DT43: node("Table 4_3", 1, -2)

arrow from FT.e to DT1.w dashed
arrow from DT1.e to DT11.w color gray
arrow from FT.w to DT2.e color gray
arrow from FT.n to DT3.s color gray
arrow from FT.s to DT4.n color gray
arrow from DT4.s to DT41.n color gray
arrow from DT4.s to DT42.n color gray
arrow from DT4.s to DT43.n color gray

Now we are with dimension table 2. And we decide to keep it or not. Lets say we keep it

define node {circle "Dim." $1 at ($2,$3) radius 0.3 color gray}
define curr_node{circle "Dim." $1 at ($2,$3) radius 0.3 dashed}

FT: circle "Fact" "Table" at (0,0) radius 0.3 

DT1: node("Table 1",1,0)
DT11: node("Table 1_1",2,0)
DT2: curr_node("Table 2",-1,0) 
DT3: node("Table 3",0,1)
DT4: node("Table 4",0,-1)
DT41: node("Table 4_1", -1, -2)
DT42: node("Table 4_2", 0, -2)
DT43: node("Table 4_3", 1, -2)

arrow from FT.e to DT1.w color red
arrow from DT1.e to DT11.w color gray
arrow from FT.w to DT2.e dashed
arrow from FT.n to DT3.s color gray
arrow from FT.s to DT4.n color gray
arrow from DT4.s to DT41.n color gray
arrow from DT4.s to DT42.n color gray
arrow from DT4.s to DT43.n color gray

We now explore Dimension table 2, since there are no more edges spawning from Dim. Table 2 we keep going with Dim. Table 3. Lets say we don't keep it:

define node {circle "Dim." $1 at ($2,$3) radius 0.3 color gray}
define curr_node{circle "Dim." $1 at ($2,$3) radius 0.3 dashed}
define good_node{circle "Dim." $1 at ($2,$3) radius 0.3}

FT: circle "Fact" "Table" at (0,0) radius 0.3 

DT1: node("Table 1",1,0)
DT11: node("Table 1_1",2,0)
DT2: good_node("Table 2",-1,0) 
DT3: curr_node("Table 3",0,1)
DT4: node("Table 4",0,-1)
DT41: node("Table 4_1", -1, -2)
DT42: node("Table 4_2", 0, -2)
DT43: node("Table 4_3", 1, -2)

arrow from FT.e to DT1.w color red
arrow from DT1.e to DT11.w color gray
arrow from FT.w to DT2.e 
arrow from FT.n to DT3.s dashed
arrow from FT.s to DT4.n color gray
arrow from DT4.s to DT41.n color gray
arrow from DT4.s to DT42.n color gray
arrow from DT4.s to DT43.n color gray

We now go on to Dimension table 4 and lets say we keep it.

define node {circle "Dim." $1 at ($2,$3) radius 0.3 color gray}
define curr_node{circle "Dim." $1 at ($2,$3) radius 0.3 dashed}
define good_node{circle "Dim." $1 at ($2,$3) radius 0.3}

FT: circle "Fact" "Table" at (0,0) radius 0.3 

DT1: node("Table 1",1,0)
DT11: node("Table 1_1",2,0)
DT2: good_node("Table 2",-1,0) 
DT3: node("Table 3",0,1)
DT4: curr_node("Table 4",0,-1)
DT41: node("Table 4_1", -1, -2)
DT42: node("Table 4_2", 0, -2)
DT43: node("Table 4_3", 1, -2)

arrow from FT.e to DT1.w color red
arrow from DT1.e to DT11.w color gray
arrow from FT.w to DT2.e 
arrow from FT.n to DT3.s color red
arrow from FT.s to DT4.n dashed
arrow from DT4.s to DT41.n color gray
arrow from DT4.s to DT42.n color gray
arrow from DT4.s to DT43.n color gray

Now we have to iterate over the nodes of Dim. Table 4. So we first add Dim. Table 4 and start looping from Dim. Table 4_1 and so on

If we had a max. limits of hops of 1, we would stop here since only one hope is allowed, But lets keep going to the children of Dim. Table 4

define node {circle "Dim." $1 at ($2,$3) radius 0.3 color gray}
define curr_node{circle "Dim." $1 at ($2,$3) radius 0.3 dashed}
define good_node{circle "Dim." $1 at ($2,$3) radius 0.3}

FT: circle "Fact" "Table" at (0,0) radius 0.3 

DT1: node("Table 1",1,0)
DT11: node("Table 1_1",2,0)
DT2: good_node("Table 2",-1,0) 
DT3: node("Table 3",0,1)
DT4: good_node("Table 4",0,-1)
DT41: curr_node("Table 4_1", -1, -2)
DT42: node("Table 4_2", 0, -2)
DT43: node("Table 4_3", 1, -2)

arrow from FT.e to DT1.w color red
arrow from DT1.e to DT11.w color gray
arrow from FT.w to DT2.e 
arrow from FT.n to DT3.s color red
arrow from FT.s to DT4.n
arrow from DT4.s to DT41.n dashed
arrow from DT4.s to DT42.n color gray
arrow from DT4.s to DT43.n color gray

Now we do the same thing. Lets say we keep Dim table 4_1, and Dim. table 4_3 while not keeping dim_table 4_2. We end up with this

define node {circle "Dim." $1 at ($2,$3) radius 0.3 color gray}
define curr_node{circle "Dim." $1 at ($2,$3) radius 0.3 dashed}
define good_node{circle "Dim." $1 at ($2,$3) radius 0.3}

FT: circle "Fact" "Table" at (0,0) radius 0.3 

DT1: node("Table 1",1,0)
DT11: node("Table 1_1",2,0)
DT2: good_node("Table 2",-1,0) 
DT3: node("Table 3",0,1)
DT4: good_node("Table 4",0,-1)
DT41: good_node("Table 4_1", -1, -2)
DT42: node("Table 4_2", 0, -2)
DT43: good_node("Table 4_3", 1, -2)

arrow from FT.e to DT1.w color red
arrow from DT1.e to DT11.w color gray
arrow from FT.w to DT2.e 
arrow from FT.n to DT3.s color red
arrow from FT.s to DT4.n
arrow from DT4.s to DT41.n 
arrow from DT4.s to DT42.n color red
arrow from DT4.s to DT43.n

Which translates to this subgraph

define node {circle "Dim." $1 at ($2,$3) radius 0.3 color gray}
define curr_node{circle "Dim." $1 at ($2,$3) radius 0.3 dashed}
define good_node{circle "Dim." $1 at ($2,$3) radius 0.3}

FT: circle "Fact" "Table" at (0,0) radius 0.3 

DT2: good_node("Table 2",-1,0) 
DT4: good_node("Table 4",0,-1)
DT41: good_node("Table 4_1", -1, -2)
DT43: good_node("Table 4_3", 1, -2)

arrow from FT.w to DT2.e 
arrow from FT.s to DT4.n
arrow from DT4.s to DT41.n 
arrow from DT4.s to DT43.n

Right now we can transform our subgraph to a SQL query by using:

SELECT COUNT(*)
FROM 
  fact_table ft,
  dim_table_2 dt2,
  dim_table_4 dt4,
  dim_table_4_1 dt41,
  dim_table_4_3 dt43
WHERE
  ft.fk_dt4 = dt4.ref_fk AND
  ft.fk_dt2 = dt2.ref_fk AND
  dt4.fk_dt41 = dt41.ref_fk AND
  dt4.fk_dt43 = dt43.ref_fk AND

Adding predicates

In our example we have added the subgraph which translates to this query:

SELECT COUNT(*)
FROM 
  fact_table ft,
  dim_table_2 dt2,
  dim_table_4 dt4,
  dim_table_4_1 dt41,
  dim_table_4_3 dt43
WHERE
  ft.fk_dt4 = dt4.ref_fk AND
  ft.fk_dt2 = dt2.ref_fk AND
  dt4.fk_dt41 = dt41.ref_fk AND
  dt4.fk_dt43 = dt43.ref_fk AND

We now want to add some predicates that make sense. We want to use the equidepth histogram of predicates. We select extra_predicates number of predicates to add using histograms with row_retention_probability

If extra_predicates is 2, this means we will choose 2 random columns to do the procedure. The variable row_retention_probability is the filter by which we will select a histogram.

Lets say we have a histogram h1 = [0,10,20,30,40,50,60,70,80,90,100] and we want to select 20% Thus we select a random position. Say 70, and we add to positions. thus we end up with a range of [70,90]

Which transforms our query to:

SELECT COUNT(*)
FROM 
  fact_table ft,
  dim_table_2 dt2,
  dim_table_4 dt4,
  dim_table_4_1 dt41,
  dim_table_4_3 dt43
WHERE
  ft.fk_dt4 = dt4.ref_fk AND
  ft.fk_dt2 = dt2.ref_fk AND
  dt4.fk_dt41 = dt41.ref_fk AND
  dt4.fk_dt43 = dt43.ref_fk AND
  70 <= ft.col1 AND
  90 <= ft.col1

If we select another column it will be like:

SELECT COUNT(*)
FROM 
  fact_table ft,
  dim_table_2 dt2,
  dim_table_4 dt4,
  dim_table_4_1 dt41,
  dim_table_4_3 dt43
WHERE
  ft.fk_dt4 = dt4.ref_fk AND
  ft.fk_dt2 = dt2.ref_fk AND
  dt4.fk_dt41 = dt41.ref_fk AND
  dt4.fk_dt43 = dt43.ref_fk AND
  70 <= ft.col1 AND
  90 <= ft.col1 AND
  6.5 <= dt4.col2 AND
  9 <= dt4.col2

We also want to add other predicates like the IN-predicate and the = predicate. For these predicates we use Most-common values (MCV). We know some variables and their frequency. We can use MCV to select values that fulfill the minimum lower bound for = and IN-statements set by
equality_lower_bound_probability. Say we know that the column ft.col2 has the following frequencies (normalized)

'a' 50%
'b' 20%
'c' 10%

Then if our equality_lower_bound_probability is 0.2 we will include b and a but not c. Thus we could create the following query:

SELECT COUNT(*)
FROM 
  fact_table ft,
  dim_table_2 dt2,
  dim_table_4 dt4,
  dim_table_4_1 dt41,
  dim_table_4_3 dt43
WHERE
  ft.fk_dt4 = dt4.ref_fk AND
  ft.fk_dt2 = dt2.ref_fk AND
  dt4.fk_dt41 = dt41.ref_fk AND
  dt4.fk_dt43 = dt43.ref_fk AND
  ft.col1 = 'a' AND
  ft.col1 IN ('b', 'f', 't')

Where there rest of the elements of the IN-statement are taken from the histogram bounds. only the first one has to fulfill the condition set by equality_lower_bound_probability.

Overall pipeline

The Snowflake algorithm in broad terms looks like this

down
circle "start" fit
arrow
box "choose fact" "table randomly" fit
arrow
box "generate subgraph" fit
arrow
LS: circle "Begin" "predicate" "generation" fit
arrow
box "Pick random column." fit
arrow
D: diamond "Choose" "Operator" fit
arrow "Range" aligned above
box "Generate a range" "predicate according to" "row retention probability" fit
arrow
LE: circle "End" "predicate" "generation" fit
arrow
box "Query Builder" "makes the SQL" "statement" fit
arrow
circle "End"


L1: line from LE.e to (LE + (3,0))
L2: line from (LS + (3,0)) to L1.e "Repeats " aligned "Up to extra_predicates times" aligned
L3: arrow from L2.n to LS.e


E: box at (D + (1.5,-1.2)) "Generate Equal" "With one MCV" fit 
arrow from D.e to E.n "Equal" aligned above 
arrow from E.s to LE.n 

IN: box at (D + (-1.9,-1.2)) "Generate IN" "With one MCV" "and random elements" fit 
line from IN.n to D.w  "IN" aligned above 
arrow from D.w to IN.n 
arrow from IN.s to LE.n