Part 2: Guide to building high-accuracy Text-to-SQL BI agents

Alanna Brown

VP of Marketing

In Part 1, we explored the common challenges enterprise data teams face—like juggling ad hoc business requests alongside more strategic initiatives—and unpacked some of the blockers to building high-accuracy text-to-SQL BI agents.

The first two steps in the workflow focus on business alignment, which is foundational to everything that follows. To recap:

Schema review and alignment – Deep-dive into your schema to understand table structures and relationships. Work with stakeholders to map real business questions to the right fields and tables.
Create a glossary file – Codify your company’s unique terminology and business concepts to help the model understand your domain.

Steps 3 through 6 are where the real lift—and magic—happens. This is where you’ll spend the bulk of your time—because high-quality training and evaluation data are the backbone of high-accuracy AI systems. We’ll walk through the next steps:

Build your evaluation set (gold test set) – Define your training objective and create a reliable benchmark to measure model accuracy.
Develop synthetic training data – Start with 20 good examples, then scale your dataset using synthetic data generation pipelines.
Validate SQL queries – Generate and validate SQL outputs for syntactic correctness, schema alignment, and accuracy.
Memory-tune your model – The fun part: fine-tune your model using the training data you've developed so far.

Let’s dive in.

Step 3: Create evaluation (gold) set

Your evaluation (or “gold”) test set is how you’ll know if your model is doing its job. Think of it as both compass and North Star—so define what “good” means up front. For Text-to-SQL, it means SQL that’s both syntactically correct (the SQL is executable) and semantically correct (it returns the expected response)—and returns the right result in plain language.

For your initial eval set, we’ll use a JSON file with ~20 human-reviewed question-answer pairs that reflect actual end-user queries. The more examples you include, the more precise your accuracy signals—but 20 solid examples is enough to start. You can scale from there using our agentic pipelines to generate more test questions as you iterate.

Step 4: Generate synthetic training data

High-quality training data is essential for building specialized and accurate LLM applications, but curating it can be a daunting task.

Common challenges our customers face include:

Diverse sample data – Ensuring variety to improve generalization and handle ambiguous natural language queries.
Schema alignment – Mapping real-world business queries to structured database schemas.
Teaching business concepts – Helping models understand domain-specific knowledge.
Handling complex SQL queries – Generating accurate SQL for advanced query structures.

Contrary to popular belief, thousands of data points aren’t necessary to get started. We typically begin with 20-50 human-reviewed examples and expand from there. To accelerate expansion, customers use our agentic data pipelines to generate additional questions. Our data generation agents make it easy to transform messy, unlabeled data into clean, structured datasets. It ensures higher coverage by systematically finding patterns and edge cases, resulting in a more robust text-to-SQL dataset.

We then refine these questions, ensuring they are realistic and aligned with business objectives. This may involve rewording, clarifying, or expanding questions for better coverage.

There are many strategies for structuring questions. Here are two that have worked well for our customers:

By Complexity Level:
- Level 1: Simple filtering (e.g., “Show me the inventory in Georgia for January 2025.”)
- Level 2: Aggregation (e.g., “How has the inventory of homes for sale in the top three fastest-growing states changed over the past year?”)
- Level 3: Trend analysis (e.g., “How do inventory levels correlate with rental prices for multi-family homes across major metros?”)
By Theme or Topic:
- In a real estate database, questions might be grouped by rental markets, property valuations, or regional sales trends.
- Broader groupings could include questions about time-series data or temporal data, comparing regions, or joining different tables.

To prevent overfitting, we ensure semantic diversity in the training dataset. In our real estate example, we initially had multiple questions about Georgia but none about Tennessee. To address this, we duplicated the existing questions while modifying the state, effectively doubling the dataset size. This simple adjustment introduced variability, helping the model generalize better and reducing the risk of overfitting to specific inputs.

Alongside well-formed questions, include edge cases or ambiguous queries (e.g., “Show sales by region” without specifying time or product scope). These help diagnose how the model handles incomplete or vague prompts and guide fallback logic development.

By structuring questions thoughtfully, we create a solid foundation for improving LLM accuracy and ensuring meaningful query results.

Step 5: SQL query validation

To ensure the training data produces valid SQL queries, we run the generated queries against the database. If we get a valid response, then we keep it in the training dataset. Failed queries are collected and analyzed for further troubleshooting.

To troubleshoot, we identify error patterns and prioritize fixing low-hanging fruit before tackling broader issues. Examples include:

Prompt engineering enhancements: Add metadata to prompts for precise joins, constraints, and SQL functions.
Normalization of inputs: Standardize date formats, numeric values, and location-based queries.

Step 6: Memory Tune your model

Memory Tuning is a finetuning algorithm that reduces hallucinations using a million-way Mixture of Memory Experts (MoME) to specialize models with your proprietary data. You can learn more about Memory Tuning here. This is the easy part! With a few lines of code, you can run a tuning job and produce a highly accurate text-to-SQL model.

In Part 3, we’ll dig into the most important—and most often overlooked—step in building high-accuracy AI systems: Evaluation.

We'd love to chat about your Text-to-SQL use case. To get started with a customized demo, contact us here.

‍