How does our input works?

We start with a TOML file

retry = 1
llm_model = "gemma3:27b"
llm_base_prompt = """
    You are a database expert. And you are writing \
    queries to test your current database with the TPCDS \
    dataset with challenging and diverse queries. You are \
    currently re-writing new queries that should not be \
    trivial or equivalent to the input query.\
    When answering, only answer with the query in markdown,\
    for example:\
    `sql select * from table`
    """
queries_path = "/home/gabriel/query_generation/query_generation/data/generated_queries/FORMATTED_TPCDS"
total_queries = 5000
seed = 425
dataset = "TPCDS"
destination_folder = "/home/gabriel/query_generation/query_generation/data/generated_queries/LLM_QUERIES_gemma"

[llm_prompts.self_join]
prompt = "Your task is modify this query to add a self-join while keeping the predicates"
weight = 10

[llm_prompts.outer_join]
prompt = "Your task is to modify one join to add an outer-join while keeping the predicates"
weight = 10

[llm_prompts.nested_join_in]
prompt = """Your task is to take this query and add a nested query in the type of \
    an IN nested query. Make sure to add a column of outside the nested query\
    to make it a challenging query that has to be computed everytime \
    otherwise it will just be a constant. Keep the predicates."""
weight = 5

[llm_prompts.nested_join_exists]
prompt = """Your task is to take this query and add a nested query in the type of \
    an EXISTS nested query. Make sure to add a column of outside the nested query\
    to make it a challenging query that has to be computed everytime \
    otherwise it will just be a constant. Keep the predicates."""
weight = 5

[llm_prompts.window_function]
prompt = """Your task is to add a window function to this query \
    It must be done in a way that works in most SQL dialects. \
    You should keep the predicates of the query mostly intact."""
weight = 10

[llm_prompts.inequality_join]
prompt = """Your task is to modify one join to make it an inequality join \
    while keeping the predicates of the following query"""
weight = 10

[llm_prompts.group_by]
prompt = """Your task is to keep most of the query while adding a group by"""
weight = 20

[llm_prompts.group_by_order_by]
prompt = """Your task is to keep most of the query while adding a group by and order by"""
weight = 10

[llm_prompts.group_by_rollup]
prompt = """Your task is to keep most of the query while adding a group by with rollup"""
weight = 5

Where we have several parameters:

retry: how many times we will try to regenerate the query if DuckDB finds a syntax error.
llm_model: the LLM model to use from OLLAMA
llm_base_prompt the prompt to give the LLM with the role of system
queries_path queries path that we will augment with the LLM
total_queries total queries to process
dataset dataset to validate the queries.
destination_folder folder to save newly generated queries.

Finally for each new type of query that we generate we have the following instruction

[llm_prompts.self_join]
prompt = "Your task is modify this query to add a self-join while keeping the predicates"
weight = 10

Where self_join can be any string, meaning that we can add an arbitary type of queries.

prompt is the prompt that goes before showing a query.
weight is the weight of this type of prompt to be sampled from all prompts.

Diagram

down
circle "start" fit
arrow 0.3
box "Load LLM" fit
arrow 0.3
box "Load Database" fit
arrow 0.3
L1: circle "" fit
arrow 0.3
box "Pick a " "SQL query" fit
arrow 0.3
box "LLM gets `llm_base_prompt`" "as `system`" fit
arrow 0.3
box "Choose one" "llm_prompts" "based on weights" fit
arrow 0.3
box "LLM gets `prompt`" " and query" fit
arrow 0.3
W1: circle "while" "the query" "is not valid" fit
arrow 0.3
box "LLM regenerates the query" "using the error from DuckDB" fit
arrow 0.3
W2: circle ""  fit
arrow 0.3
L2: circle "" fit
arrow 0.3
circle "dataset" "generated" fit

line from W2.w \
     left 1.5 \
     then up until even with W1
arrow from last.end to W1.w  "repeat up to" aligned "retry times" aligned 

line from L2.e \
     right 1.5 \
     then up until even with L1
arrow from L1.e to last.end "repeat" aligned "total_queries times" aligned <-

Query-generation

LLM augmentation

How does our input works?

Diagram