Back to writing

Architecture · Long-form

Dual-model routing in production — why the cost classifier isn't enough

Tier 1 + Tier 2 isn't a cost lever. It's four classifiers wearing one mask.

Published · June 2026~12 min read

1. Cost is the only reason this architecture exists

Single-model AI at enterprise scale is unsustainable. A flagship model on every user intent — including the boring ones, the redundant ones, the ones that resolve to a two-field database lookup — produces a per-request cost that a serious enterprise product cannot absorb. The math is stark: a typical conversational query against a flagship model can land around three to four cents. Multiply by tens of thousands of daily interactions across a customer base and the inference line item becomes the single largest variable cost on the platform.

The fix every team converges on is some flavour of the same idea: don't use the flagship model for queries a cheaper one can handle. The cheap one routes; the expensive one executes only when it has to. The published versions of this pattern range from rule-based routing tables to learned classifiers to chained model cascades. The published versions all sell it as a cost lever.

What the published versions tend to leave out is what it actually looks like in production once you depend on it. The cost-lever framing is true. The cost-lever framing is also dangerously incomplete. The classifier isn't one classifier. The lever isn't one lever.

2. The Tier 1 + Tier 2 topology, as designed

The version we built and run inside HONO Zero UI is the standard shape — lightweight classifier in front of a heavyweight executor, with prompt caching pulling the cost of the classifier down to nearly zero:

  • Tier 1. A Haiku-class model wrapped with the Anthropic prompt caching system. The cached prompt holds the entire intent vocabulary — the things a user might be trying to do — so the marginal cost per classification call is pennies of pennies. The Tier 1 model emits a structured decision: which intent, with what confidence, and whether to escalate.
  • Tier 2.A Sonnet-class executor with the full tool repertoire, conversational reasoning, and access to the MCP surface. It's the one that actually does the work when the work is genuinely complex. Expensive per call, but fired only on the queries that earn it.
  • Traffic split, target. Roughly 60% of queries resolve at Tier 1 only (clear intent, simple path). 30% escalate cleanly to Tier 2 (ambiguity, complexity, or known multi-step work). The remaining 10% bounce — Tier 1 attempts, fails, escalates mid-conversation.

On paper, the math is great. The headline figure from this pattern in our deployment is a per-request cost that drops from around 3¢ on a naive single-model architecture to roughly 0.5¢ on the routed one. That's the published story. The numbers are real.

Several classifiers · two model tiers · one async track for follow-ups

User query

Show me who applied for the most leaves this year

Looks simple on the surface. Implies fifty employees, twelve months, aggregation, ranking.

Routing layer

Classifier

Several classifiers, each on a different axis

Intent (what is the user trying to do?) · Scope (cardinality, temporal range, aggregation?) · Conversation (where does this turn sit in the arc?) · Confidence (how sure are we?) — each emitted independently, combined into a routing decision.

Route A

Tool path

Aggregate tool

High cardinality OR aggregation required → one bulk call to a domain aggregate tool. SQL does the work.

Route B

Tier 1 · cheap

Tier 1 only

Clear, single-entity intent, high confidence → cheap classifier-class model with simple tools. ~60% of traffic.

Route C

Tier 2 · capable

Tier 2 escalation

Ambiguous intent OR low confidence OR multi-step reasoning required → heavy executor with full tool repertoire.

Response surface

The primary content arrives from whichever tier handled the query

At Tier 1 speed when Tier 1 was right. At Tier 2 quality when Tier 2 was warranted. The user sees one stream.

Async track · suggestion pills

Async

Decoupled follow-up generation, always at Tier 2 quality

Runs alongside the answer, never blocking it. Suggestion pills arrive a beat after the primary response. Cost classifier optimises answers; suggestions are a separate cognitive task on their own track.

The published version of dual-model routing is one classifier choosing between two model tiers. The shipping version is several cheap classifiers feeding two model tiers and at least one async generator — each dimension caught what the others missed.

The published story is also the part that worked first try. The four things below are what we found out after we shipped.

3. Where it breaks

Four classes of failure showed up in production, in roughly the order of severity we hit them. Three are now solved or solving. The fourth is honest open work.

3.1 The aggregation trap

The first one is the one every team will hit, and it's the one that taught us the deepest lesson about what routing actually means.

A manager with fifty direct reports asks the system: “Show me who has applied for the most leaves over the last year.”

On the surface, this is a clean intent. Leave is a known category. Applied is a known verb. Last year is a known temporal qualifier. The Tier 1 classifier looks at the sentence and sees nothing conceptually unusual. It tags it as a leave-query and routes to the standard leave path — the same path that handles “what's my leave balance?” and “did my December leave application get approved?”

That path is built around a per-employee tool. It expects to look up one person's leave records. The executor, handed this aggregation query, does the most literal thing a model can do — it tries to iterate. Fifty employees, fifty tool calls, fifty leave records to read and summarise. Two things happen, neither of them survivable:

  • Wrong totals. Asking a language model to consume fifty partial JSON responses and emit a correctly-ranked aggregation produces aggregation that is plausible-looking but wrong. Numbers drop, names get reattributed, ranks drift. The model doesn't know it's hallucinating a total because the inputs technically exist.
  • Rate limits. Fifty parallel tool calls against the same HR backend hit throttling and back-pressure mechanisms that exist precisely to stop this. The first few queries succeed; the rest fail or get queued long enough that the user has given up on the answer.

The same shape appears, unchanged, in performance: “Who has been the lowest performer in the last three years combined?” The classifier sees performance; the executor tries to iterate; the answer is wrong and slow. We saw it in attendance. We saw it in expenses. Anywhere a query implicitly says many employees or long time window or aggregate, the per-entity routing path collapses.

Four ways out — A, B, C, D

When we sat down to fix this, we found four solution shapes. The first three are real architecture; the fourth is the BI escape hatch. We ended up using a combination, and the choice of which is interesting in itself.

Option A — Expose aggregate tools. Add tools that return pre-aggregated answers in one call: a getTeamLeaveAggregate that takes a manager ID, a date range, a group-by and an order-by; a getPerformanceRanking with the same shape. The executor calls one tool. The HRMS does the aggregation in SQL where it belongs. The model never sees fifty raw records.

Option B — Text-to-GraphQL on aggregations.Let the executor compose a single GraphQL query against the existing schema that expresses the aggregation. The platform exposes around 1,429 GraphQL operations; teaching the executor to compose aggregate queries against them is a natural extension. The upside is full expressivity. The downside is harder safety-gating, because now you're executing model-composed queries against a real transactional database.

Option C — Add a scope classifier.This is the option that fits the routing narrative most directly, and it's the architectural insight underneath the whole article. The original mistake was treating intent as the only axis the classifier needed to reason about. Scope is a second axis, distinct from intent. A classifier that emits not just an intent label but also a cardinality (one entity / a team / the org), a temporal range (now / a window / many years), and an aggregation requirement (none / count / rank / compare) can refuse to send a high-scope query down the per-entity loop path regardless of how simple the intent looked.

Option D — Pre-computed analytics views. For known analytical queries — top ten by leave usage, lowest performers, expense outliers — pre-compute nightly and serve from cache. This is the BI-style fallback. Cheap to serve, expensive to build initially, only viable for queries that are common and predictable.

What we actually shipped

We're rolling out Option A, one domain at a time. Leave aggregation tools are live in production. Performance aggregation tools are in flight. Attendance and expenses are queued behind them. Each one collapses an entire class of wrong-total + rate-limit failure into a single deterministic tool call that the database is happy to serve.

Option C — the scope classifier — is the change we wish we had made on day one. We're retrofitting it now. The aggregate tools work without it, but having the router actively refuse to send a 50-reportee question down a per-employee path is the architectural truth, not just a per-domain workaround.

Option B sits as the long-tail extension once the safety model around model-composed queries is settled. Option D is interesting to a BI product, not to a conversational one.

The lesson generalises. The cost classifier was trying to answer “is this simple or complex?”The real question turned out to be “what shape does the answer have?”Routing on intent without routing on scope is the architectural bug that produced every aggregation incident we've seen.

3.2 The threshold problem

The classifier emits a confidence score. Above some threshold, the query stays at Tier 1; below it, escalation. The temptation is to find the right number once and ship it.

No such number exists across tenants. The right threshold drifts with the customer's query mix. A client whose employees mostly ask balance-and-status questions wants the threshold high — let Tier 1 handle nearly everything; only escalate the genuinely strange. A client with managers who routinely ask multi-entity analytical questions wants the threshold lower — escalate aggressively, eat the Tier 2 cost because the wrong answer is worse than the expensive answer. A client onboarding their first cohort tends toward a broader, less predictable query distribution and benefits from being conservative until the patterns stabilise.

In practice the threshold is per-tenant, observed rather than chosen. The data path that matters is a feedback loop on classifier confidence histograms versus downstream outcomes — when did the Tier 1 path return an answer that was correct, when did it return an answer that needed escalation later in the conversation. That signal drifts slowly enough that quarterly tuning is enough; fast enough that it can't be set-and-forget.

3.3 Multi-turn context drift

The classifier is stateless on each turn. Turn one comes in, gets classified as simple, gets answered by Tier 1. Turn two comes in as a follow-up: “what if I took five days next month instead?”The classifier looks at the second turn in isolation, sees vague conditional language, decides it's probably simple — and routes to Tier 1 again. Tier 1 doesn't have the conversational context that established what the user was asking about. The answer is bad in a way the user cannot diagnose.

The naive fix is to pass full conversation history into every classification. That works and is slow. The cleaner fix is to pass routing history — a compact summary of what tier handled which turn and what intent family the conversation has been about. The classifier then has the signal it needs without re-reading the conversation every turn.

Multi-turn drift is the failure mode that's easiest to underestimate, because it doesn't produce spectacular incidents — it produces a quiet erosion of answer quality over the third, fourth, fifth turn of a serious conversation. The cost classifier was succeeding turn-by-turn. The conversation, as a whole, was losing resolution.

3.4 The suggestion-pill gap (open problem)

The fourth failure is the one we haven't shipped a fix for yet, and it's the one most worth describing honestly because it has implications for any team building this pattern.

Every answer in the Zero UI surface is paired with a row of suggestion pills — the contextual “you might want to ask next”follow-ups under each response. They're a substantial part of the user experience; people lean on them to discover what the system can do, and they shape the next turn of the conversation.

The cost classifier optimises the answer path. It does not optimise the suggestion path. So a query that gets handled at Tier 1 — because the classifier said it was simple — produces a perfectly good answer and visibly weaker suggestions. The suggestions are weaker because the model generating them had less context, less reasoning depth, less of the rich semantic frame the question deserved. The user gets a correct answer surrounded by follow-ups that don't feel as relevant as the ones they get on the same screen a moment later when a different query happens to escalate to Tier 2.

This is the cost lever paying part of its cost in downstream UX rather than in money. It's subtle and hard to spot until you sit in front of the product for an afternoon and watch the suggestion quality flicker between turns.

The pattern we're converging on

The approach we're designing — not yet shipped — is to decouple suggestion generation from the answer path entirely. The primary content streams from whichever tier handled the query, fast. A separate async pipeline generates the suggestion pills, always through Tier 2 (or, eventually, a dedicated small model trained specifically on the follow-up generation task).

The cost economics still work. Suggestion generation uses a fraction of the output tokens that answer generation does, so running it through Tier 2 even when the answer came from Tier 1 stays cheap. The UX economics improve meaningfully — users get the answer at Tier 1 speed and Tier 2 quality on the follow-ups, regardless of who handled the original.

The architectural lesson, before we ship it, is the one worth naming. Cost routing is a decision about answering. Suggestion generation is a different cognitive task. Forcing them to share a routing decision makes one of them worse than it needs to be. The shipping version of the architecture is one where the two run on parallel tracks and reunite at the response surface.

4. What this taught us

The headline thing we believed when we shipped this was that dual-model routing was one classifier doing one job. The headline thing we believe now is that cost routing in production is a family of classifiers, each operating on a different dimension, and that getting any one of them wrong produces a class of failure the other classifiers cannot catch.

The dimensions, in the order they bit us:

  • Intent. What is the user trying to do? This is the classifier that gets the press.
  • Scope.How many entities, over what time window, with what aggregation requirement? This is the classifier we didn't know we needed.
  • Conversational context. Where is this turn in the arc of a multi-turn exchange? What did earlier turns establish?
  • Downstream UX. What does the answer need to be surrounded by? The cost decision on the answer is not a cost decision on the suggestions, and pretending it is degrades the experience.

Stacking classifiers is the part of this that's actually hard. Each one is cheap on its own. Each one catches a class of failure the others miss. The architectural shape that emerges is not Tier 1 → Tier 2. It's a small mesh of cheap classifiers feeding into two model tiers and at least one async generator, coordinated by a routing layer that treats each classifier's output as a separate dimension of the decision.

5. What we'd build differently

If we were starting again, three things would be in the first version, not retrofitted later.

A scope classifier from day one. Cardinality, temporal range, and aggregation requirement emitted alongside the intent. The router treating high-scope queries as a distinct path whose default destination is an aggregate tool, not a per-entity loop.

Per-tenant threshold tuning observability. A dashboard from launch that shows the confidence distribution per tenant and the downstream success rate at each threshold band. The instinct to find one number that works is a false economy; the threshold needs to be observed, not chosen.

Suggestion generation on its own track. Decoupled from the answer path. Always at Tier 2 quality. Async to the primary response. The cost saved on the answer should not become a cost paid by the follow-up surface.

Cost routing is one classifier doing one job. Cost routing in production is several classifiers, each catching what the others missed.

The cost-lever framing was true. It was also dangerously incomplete. The version of dual-model routing that ships cleanly into an enterprise product is the version that stops pretending one classifier can carry the architecture on its own.

Thinking about classifier-driven routing, aggregation architecture, or how AI cost levers actually behave at enterprise scale? Let's talk.

More writing