Part 3: Building high accuracy Text-to-SQL BI agents

Alanna Brown

VP of Marketing

In Part 1 and Part 2, we walked through the first 6 steps of our text-to-SQL Memory Tuning workflow. Here’s a quick recap:

Schema review and alignment – Deep-dive into your schema to understand table structures and relationships. Work with stakeholders to map real business questions to the right fields and tables.
Create a glossary file – Codify your company’s unique terminology and business concepts to help the model understand your domain.
Build your evaluation set (gold test set) – Define your training objective and create a reliable benchmark to measure model accuracy.
Develop synthetic training data – Start with 20 good examples, then scale your dataset using synthetic data generation pipelines.
Validate SQL queries – Generate and validate SQL outputs for syntactic correctness, schema alignment, and accuracy.
Memory-tune your model – The fun part: fine-tune your model using the training data you've developed so far.

In Part 3, we’ll walk through the final two steps: Evaluation and Iteration—where you measure performance, learn from failure modes, and continuously improve your model.

‍Step 7: Evaluation

Evaluations should be both your compass and your North Star—guiding you toward your definition of “good.” A robust evaluation framework replaces “vibe checks” with confidence that your model is production-ready. It shows whether you’re meeting your objectives, if the model is improving or regressing, and where it needs improvement.

Evals should be tailored to your specific use case and business objectives. For text-to-SQL, that means checking whether SQL queries are both semantically and syntactically correct and whether they return the right answer in natural language. You may also need to account for more nuanced objectives, like compliance with internal policies or query explainability.

A text-to-SQL eval set includes human-reviewed user inputs paired with validated SQL outputs. We recommend running evals first against a base model to benchmark performance, then running it against your tuned model so you can track improvements.

There are three criteria for a good evaluation. It should be:

Quantitative - Track progress with a small set of metrics. This is how you will know if your model is actually improving over time.
Diagnostic - Identify failure patterns, like hallucinations or incorrect logic, so you know where to focus next.
Scalable - It should be fast, cost-effective, and able to grow with your model, incorporating LLM-based scoring as well as human feedback.

Once the eval set is ready, we can run eval on the tuned Text-to-SQL model.

For quantitative scoring, we use a two-step evaluation process:

First, we validate whether the generated SQL query is executable by running it against the database. This ensures the query is syntactically correct and can be executed.
If the query is valid, we then compare the results from both the reference query and the generated query. We use LLM-as-judge method to evaluate whether the data matches between the two queries and provide a score of 0-100 to show the percentage of matching responses.

Hint: Don’t aim for 100% on your eval—if you hit it, your test is probably too easy.

We can use any errors to generate new data. We look for:

Error patterns: Identify and address recurring errors in WHERE clauses, joins, and aggregations.
Edge cases: Rigorously test complex queries with nested conditions and multiple filters.

Hallucinations in joins: Queries may reference nonexistent tables; resolve with explicit schema constraints.
Misinterpreted date filtering: Some queries apply incorrect date filters; refine prompt logic to address this.
Incorrect aggregations: Missing GROUP BY clauses; improved SQL generation rules.

Then we repeat the cycle by modifying the training data set, tuning the model again, and running eval until we reach the desired level of accuracy.

Step 8 - Keep iterating

Evaluation isn’t a one-time task—it’s an ongoing, iterative process. While this may seem daunting, it’s actually a powerful way to build confidence in your system. Frequent small refinements drive rapid improvements, allowing you to see measurable progress in real time.

Here’s how the evaluation process works:

Start with a high-quality eval set – Begin with 20-50 carefully curated examples. Quality matters more than quantity.
Run, evaluate, and diagnose – Test your model, analyze results, and identify hallucinations. Look for patterns in errors rather than trying to fix everything at once.
Expand for depth and complexity – Gradually increase the evaluation dataset to cover more nuanced queries and edge cases.
Rinse and repeat – Continue this cycle until you reach the desired level of accuracy.

By continuously iterating, you systematically improve model performance and ensure long-term reliability as your data drifts.

Conclusion

The fastest path to data-driven decisions isn’t another dashboard, it’s enabling users to access data and get answers in real-time.

This guide gives you the blueprint we’ve used with some of the world’s most data-intensive enterprises to hit 90%+ accuracy in production-grade text-to-SQL systems. It works because it’s built for your schema, your metrics, your real-world edge cases—not generic benchmarks.

If you’re ready to unlock real-time insights for your business users without drowning your data team, let’s talk. We’ll show you what high-accuracy looks like on your data, inside your environment, in a matter of weeks.

We'd love to chat about your text-to-SQL use case. To get started with a customized demo, contact us here.

‍