Technical Overview

A better model is not enough.
Data needs a better harness.

By Will Bunting and Forest FangApril 20266 min read

Frontier models can produce fluent SQL and confident prose. That still does not guarantee correct, reproducible analysis for business-critical questions.

Across every model generation we've tested, the same pattern shows up: output on real business data is often only mostly right. Data work is less deterministic than code work. In software, compilers, type checkers, linters, and tests expose mistakes quickly. In analytics, there is no equivalent default harness to catch subtle metric and logic errors before they are trusted.

That harness is what we've been building. It is not just rich context stores — it's what makes a confident answer a correct one.

What's changing

Two shifts are changing how serious teams think about AI in data work.

Models

The model itself has stopped being the bottleneck. The agentic moves data analysis runs on — discovering tables, sequencing tool calls, recovering from failures — used to give the frontier a clear edge. That edge has closed. Picking one model over another is no longer where the next correctness gain comes from.

Context

The industry has shifted its attention from prompts and models to the context the agent works against. The argument goes that prompts and models are commodities; the rich context stores a team builds out — the semantic layer, the metric catalog, the dimensional model — are what determine whether an agent gives the right answer.

Where it breaks down

The metric catalog can be populated and the dimensional model curated, and the agent can still write a query that ignores both. The catalog might define "active customer" a specific way; the agent can hand-roll its own definition with a slightly different threshold and return a confident, runnable, wrong number. The reason is mechanical: the catalog is information, not a constraint. It sits in the prompt and competes with the schema, the wording of the question, and the model's training in writing SQL from scratch. Information loses by default. Context populates the prompt. It does not make the agent use it.

Why it matters

Most leading teams are now investing real effort in giving their AI agents the right context: semantic layers, governed metric catalogs, curated dimensional models. The intuition behind that investment is correct. The work is necessary. What we keep measuring is how it combines with the rest of the system to produce a correct answer.

On 20 reviewed analytics questions against a real customer warehouse, a frontier model with raw schema access answered 45% correctly. Adding a governed metric catalog and dimensional model brought it to 50%. The same model running inside Vuon's harness — same catalog, same warehouse, same questions — answered 65%. The catalog earned a quarter of the available lift. The harness around it earned the other three-quarters.

Correctness across 20 reviewed analytics questions

Claude + schema
45%
Claude + catalog
50%
Vuon
65%

Same frontier model and warehouse across all three. The catalog earned a quarter of the available lift; the harness earned the rest.

AI answers that compile and look right are not the same as AI answers that are right. The difference is invisible from a demo and visible in the numbers that get acted on. The next problem the industry needs to solve isn't a catalog problem. It's an enforcement problem. The semantic layer didn't fail. It just didn't get used.

What enforcement looks like in practice

A team asks: "How concentrated is dashboard creation among users?"

Both systems have the same metric catalog and dimensional model in their context. Here's what each does next.

Claude + catalog

Reads the question, scans the schema for tables with dashboards in them, finds app_dashboards. That's where the action is, so it starts there:

SELECT
creator_user_id,
COUNT(*) AS dashboards_created
FROM app_dashboards
WHERE deleted_at IS NULL
GROUP BY 1

The query returns 25,630 rows — every user who has ever created a dashboard. Conclusion: moderately concentrated.

Vuon

Reads the same question. Before any SQL gets written, the harness classifies it: a distribution across users. The semantic graph confirms dim_users as the canonical user list. The agent anchors there, and the SQL compiler verifies the anchoring before the query runs:

SELECT
u.user_id,
COUNT(DISTINCT d.dashboard_id) AS dashboards_created
FROM dim_users u
LEFT JOIN app_dashboards d
ON d.creator_user_id = u.user_id
AND d.deleted_at IS NULL
GROUP BY 1

133,146 total users. 80.8% of them created zero dashboards. Conclusion: extremely concentrated.

Two opposite conclusions, same question. The 80% of users who created nothing are right there in Vuon's result — each with a count of zero. They never showed up in Claude's at all, because they aren't in app_dashboards to begin with. The dimensional model was in both systems' context. Only Vuon's harness made the agent reach for it.

The harness didn't make the model smarter. It enforced the move the model was free to skip. Not a smarter answer, not a richer context — a query the agent could not have written any other way.

What this means for leading data teams

If context is necessary but not sufficient, the work is making the context stick. Four mechanisms turn it from a passive artifact into an active constraint. They close the gap between "the agent has access to the right definition" and "the agent's answer reflects that definition." Any production data agent stack worth evaluating needs all four.

Contextually aware SQL compilation

A SQL compilation layer that understands warehouse structure and business definitions. It flags misaligned queries in the same spirit that a code compiler flags undefined variables or invalid types.

Semantic and policy graph

Metric definitions, ownership, exclusions, and approved logic are represented as operational context rather than as loose prose stuffed into a prompt. That turns semantic drift into something the agent can detect and correct.

Post-execution validation

After execution, the system evaluates outputs across multiple statistical and structural dimensions. It checks for impossible values, denominator drift, discontinuities, and other signatures that the query or the underlying data may be wrong.

Tracked and versioned calculations

Every calculation the agent makes is tracked, versioned, and available to rerun as the analysis develops. That gives the system a reproducibility loop and lets it catch its own inconsistencies before the answer is delivered.

Buying or building one or two of these and skipping the others produces a system that fails in fewer places but in the same ways. The four reinforce each other: pre-execution checks catch what the semantic graph specifies; post-execution checks catch what still slips through; versioning makes the whole loop reproducible and auditable.

Bottom line

Same context. Same model. Different answers. What decides which answer you get is the system between the model and the warehouse — whether it enforces the semantic layer's definitions, challenges the agent's outputs, and replays the analytical path. The gap won't close until the system around it does the work the model is free to skip.