How does our input works?
We start with a TOML file
retry = 1
llm_model = "gemma3:27b"
llm_base_prompt = """
You are a database expert. And you are writing \
queries to test your current database with the TPCDS \
dataset with challenging and diverse queries. You are \
currently re-writing new queries that should not be \
trivial or equivalent to the input query.\
When answering, only answer with the query in markdown,\
for example:\
`sql select * from table`
"""
queries_path = "/home/gabriel/query_generation/query_generation/data/generated_queries/FORMATTED_TPCDS"
total_queries = 5000
seed = 425
dataset = "TPCDS"
destination_folder = "/home/gabriel/query_generation/query_generation/data/generated_queries/LLM_QUERIES_gemma"
[llm_prompts.self_join]
prompt = "Your task is modify this query to add a self-join while keeping the predicates"
weight = 10
[llm_prompts.outer_join]
prompt = "Your task is to modify one join to add an outer-join while keeping the predicates"
weight = 10
[llm_prompts.nested_join_in]
prompt = """Your task is to take this query and add a nested query in the type of \
an IN nested query. Make sure to add a column of outside the nested query\
to make it a challenging query that has to be computed everytime \
otherwise it will just be a constant. Keep the predicates."""
weight = 5
[llm_prompts.nested_join_exists]
prompt = """Your task is to take this query and add a nested query in the type of \
an EXISTS nested query. Make sure to add a column of outside the nested query\
to make it a challenging query that has to be computed everytime \
otherwise it will just be a constant. Keep the predicates."""
weight = 5
[llm_prompts.window_function]
prompt = """Your task is to add a window function to this query \
It must be done in a way that works in most SQL dialects. \
You should keep the predicates of the query mostly intact."""
weight = 10
[llm_prompts.inequality_join]
prompt = """Your task is to modify one join to make it an inequality join \
while keeping the predicates of the following query"""
weight = 10
[llm_prompts.group_by]
prompt = """Your task is to keep most of the query while adding a group by"""
weight = 20
[llm_prompts.group_by_order_by]
prompt = """Your task is to keep most of the query while adding a group by and order by"""
weight = 10
[llm_prompts.group_by_rollup]
prompt = """Your task is to keep most of the query while adding a group by with rollup"""
weight = 5
Where we have several parameters:
retry
: how many times we will try to regenerate the query if DuckDB finds a syntax error.llm_model
: the LLM model to use from OLLAMAllm_base_prompt
the prompt to give the LLM with the role ofsystem
queries_path
queries path that we will augment with the LLMtotal_queries
total queries to processdataset
dataset to validate the queries.destination_folder
folder to save newly generated queries.
Finally for each new type of query that we generate we have the following instruction
[llm_prompts.self_join]
prompt = "Your task is modify this query to add a self-join while keeping the predicates"
weight = 10
Where self_join
can be any string, meaning that we can add an arbitary type of queries.
prompt
is the prompt that goes before showing a query.weight
is the weight of this type of prompt to be sampled from all prompts.
Diagram
down circle "start" fit arrow 0.3 box "Load LLM" fit arrow 0.3 box "Load Database" fit arrow 0.3 L1: circle "" fit arrow 0.3 box "Pick a " "SQL query" fit arrow 0.3 box "LLM gets `llm_base_prompt`" "as `system`" fit arrow 0.3 box "Choose one" "llm_prompts" "based on weights" fit arrow 0.3 box "LLM gets `prompt`" " and query" fit arrow 0.3 W1: circle "while" "the query" "is not valid" fit arrow 0.3 box "LLM regenerates the query" "using the error from DuckDB" fit arrow 0.3 W2: circle "" fit arrow 0.3 L2: circle "" fit arrow 0.3 circle "dataset" "generated" fit line from W2.w \ left 1.5 \ then up until even with W1 arrow from last.end to W1.w "repeat up to" aligned "retry times" aligned line from L2.e \ right 1.5 \ then up until even with L1 arrow from L1.e to last.end "repeat" aligned "total_queries times" aligned <-→ /pikchrshow