Query-generation

LLM augmentation
Login

LLM augmentation

How does our input works?

We start with a TOML file

retry = 1
llm_model = "gemma3:27b"
llm_base_prompt = """
    You are a database expert. And you are writing \
    queries to test your current database with the TPCDS \
    dataset with challenging and diverse queries. You are \
    currently re-writing new queries that should not be \
    trivial or equivalent to the input query.\
    When answering, only answer with the query in markdown,\
    for example:\
    `sql select * from table`
    """
queries_path = "/home/gabriel/query_generation/query_generation/data/generated_queries/FORMATTED_TPCDS"
total_queries = 5000
seed = 425
dataset = "TPCDS"
destination_folder = "/home/gabriel/query_generation/query_generation/data/generated_queries/LLM_QUERIES_gemma"

[llm_prompts.self_join]
prompt = "Your task is modify this query to add a self-join while keeping the predicates"
weight = 10

[llm_prompts.outer_join]
prompt = "Your task is to modify one join to add an outer-join while keeping the predicates"
weight = 10

[llm_prompts.nested_join_in]
prompt = """Your task is to take this query and add a nested query in the type of \
    an IN nested query. Make sure to add a column of outside the nested query\
    to make it a challenging query that has to be computed everytime \
    otherwise it will just be a constant. Keep the predicates."""
weight = 5

[llm_prompts.nested_join_exists]
prompt = """Your task is to take this query and add a nested query in the type of \
    an EXISTS nested query. Make sure to add a column of outside the nested query\
    to make it a challenging query that has to be computed everytime \
    otherwise it will just be a constant. Keep the predicates."""
weight = 5

[llm_prompts.window_function]
prompt = """Your task is to add a window function to this query \
    It must be done in a way that works in most SQL dialects. \
    You should keep the predicates of the query mostly intact."""
weight = 10

[llm_prompts.inequality_join]
prompt = """Your task is to modify one join to make it an inequality join \
    while keeping the predicates of the following query"""
weight = 10

[llm_prompts.group_by]
prompt = """Your task is to keep most of the query while adding a group by"""
weight = 20

[llm_prompts.group_by_order_by]
prompt = """Your task is to keep most of the query while adding a group by and order by"""
weight = 10

[llm_prompts.group_by_rollup]
prompt = """Your task is to keep most of the query while adding a group by with rollup"""
weight = 5

Where we have several parameters:

  1. retry: how many times we will try to regenerate the query if DuckDB finds a syntax error.
  2. llm_model: the LLM model to use from OLLAMA
  3. llm_base_prompt the prompt to give the LLM with the role of system
  4. queries_path queries path that we will augment with the LLM
  5. total_queries total queries to process
  6. dataset dataset to validate the queries.
  7. destination_folder folder to save newly generated queries.

Finally for each new type of query that we generate we have the following instruction

[llm_prompts.self_join]
prompt = "Your task is modify this query to add a self-join while keeping the predicates"
weight = 10

Where self_join can be any string, meaning that we can add an arbitary type of queries.

  1. prompt is the prompt that goes before showing a query.
  2. weight is the weight of this type of prompt to be sampled from all prompts.

Diagram

start Load LLM Load Database Pick a  SQL query LLM gets `llm_base_prompt` as `system` Choose one llm_prompts based on weights LLM gets `prompt`  and query while the query is not valid LLM regenerates the query using the error from DuckDB dataset generated repeat up to retry times repeat total_queries times
down
circle "start" fit
arrow 0.3
box "Load LLM" fit
arrow 0.3
box "Load Database" fit
arrow 0.3
L1: circle "" fit
arrow 0.3
box "Pick a " "SQL query" fit
arrow 0.3
box "LLM gets `llm_base_prompt`" "as `system`" fit
arrow 0.3
box "Choose one" "llm_prompts" "based on weights" fit
arrow 0.3
box "LLM gets `prompt`" " and query" fit
arrow 0.3
W1: circle "while" "the query" "is not valid" fit
arrow 0.3
box "LLM regenerates the query" "using the error from DuckDB" fit
arrow 0.3
W2: circle ""  fit
arrow 0.3
L2: circle "" fit
arrow 0.3
circle "dataset" "generated" fit

line from W2.w \
     left 1.5 \
     then up until even with W1
arrow from last.end to W1.w  "repeat up to" aligned "retry times" aligned 

line from L2.e \
     right 1.5 \
     then up until even with L1
arrow from L1.e to last.end "repeat" aligned "total_queries times" aligned <-