How does our input works?
We start with a TOML file
retry = 1
llm_model = "gemma3:27b"
llm_base_prompt = """
You are a database expert. And you are writing \
queries to test your current database with the TPCDS \
dataset with challenging and diverse queries. You are \
currently re-writing new queries that should not be \
trivial or equivalent to the input query.\
When answering, only answer with the query in markdown,\
for example:\
`sql select * from table`
"""
queries_path = "/home/gabriel/query_generation/query_generation/data/generated_queries/FORMATTED_TPCDS"
total_queries = 5000
seed = 425
dataset = "TPCDS"
destination_folder = "/home/gabriel/query_generation/query_generation/data/generated_queries/LLM_QUERIES_gemma"
[llm_prompts.self_join]
prompt = "Your task is modify this query to add a self-join while keeping the predicates"
weight = 10
[llm_prompts.outer_join]
prompt = "Your task is to modify one join to add an outer-join while keeping the predicates"
weight = 10
[llm_prompts.nested_join_in]
prompt = """Your task is to take this query and add a nested query in the type of \
an IN nested query. Make sure to add a column of outside the nested query\
to make it a challenging query that has to be computed everytime \
otherwise it will just be a constant. Keep the predicates."""
weight = 5
[llm_prompts.nested_join_exists]
prompt = """Your task is to take this query and add a nested query in the type of \
an EXISTS nested query. Make sure to add a column of outside the nested query\
to make it a challenging query that has to be computed everytime \
otherwise it will just be a constant. Keep the predicates."""
weight = 5
[llm_prompts.window_function]
prompt = """Your task is to add a window function to this query \
It must be done in a way that works in most SQL dialects. \
You should keep the predicates of the query mostly intact."""
weight = 10
[llm_prompts.inequality_join]
prompt = """Your task is to modify one join to make it an inequality join \
while keeping the predicates of the following query"""
weight = 10
[llm_prompts.group_by]
prompt = """Your task is to keep most of the query while adding a group by"""
weight = 20
[llm_prompts.group_by_order_by]
prompt = """Your task is to keep most of the query while adding a group by and order by"""
weight = 10
[llm_prompts.group_by_rollup]
prompt = """Your task is to keep most of the query while adding a group by with rollup"""
weight = 5
Where we have several parameters:
retry: how many times we will try to regenerate the query if DuckDB finds a syntax error.llm_model: the LLM model to use from OLLAMAllm_base_promptthe prompt to give the LLM with the role ofsystemqueries_pathqueries path that we will augment with the LLMtotal_queriestotal queries to processdatasetdataset to validate the queries.destination_folderfolder to save newly generated queries.
Finally for each new type of query that we generate we have the following instruction
[llm_prompts.self_join]
prompt = "Your task is modify this query to add a self-join while keeping the predicates"
weight = 10
Where self_join can be any string, meaning that we can add an arbitary type of queries.
promptis the prompt that goes before showing a query.weightis the weight of this type of prompt to be sampled from all prompts.
Diagram
# BPMN-like LTR layout (no overlaps)
boxrad = 10px
gap = 0.5
right
Start: circle "Start" fit
arrow gap
box "Load LLM" fit
arrow gap
box "Load Database" fit
arrow gap
Pick: box "Pick a" "SQL query" fit
arrow gap
box "LLM gets `llm_base_prompt`" "as `system`" fit
arrow gap
box "Choose one `llm_prompts`" "based on weights" fit
arrow gap
box "LLM gets `prompt`" "and query" fit
arrow gap
Gval: diamond wid 0.8 ht 0.7 "Valid?" fit
# Yes path
Append: box "Append query" "to dataset" fit with .w at (Gval.e + (gap,0))
arrow from Gval.e to Append.w "yes" above
Gmore: diamond wid 1.0 ht 0.8 "More queries?" fit with .w at (Append.e + (gap,0))
arrow from Append.e to Gmore.w
End: circle "End" fit thickness 3px with .w at (Gmore.e + (gap,0))
arrow from Gmore.e to End.w "no" above
# No (retry) path
DE: diamond "More retries?" at (Gval.s + (0,-0.8))
Reg: box "Regenerate query" "using DuckDB error" fit \
with .n at (Gval.s + (0,-1.7))
arrow from Gval.s to DE.n "no" aligned ""
arrow from DE.s to Reg.n "yes" aligned ""
arrow from DE.e right until even with Gmore then to Gmore.s
line from DE.e "no" "" right until even with Gmore
arrow from Reg.w left 0.1 then up until even with Gval.w then to Gval.w
# Outer loop back to Pick (yes branch)
arrow from Gmore.n up 1 then left until even with Pick.n then to Pick.n
line from Gmore.n up 1 "yes" aligned ""
→ /pikchrshow