Back

Architecture · Agentic AI · Long-form

Seven agents, one timesheet

How a 7-agent pipeline replaced a back-office workflow for a global energy & engineering workforce firm operating across 45+ countries.

May 2026~12 min read

“Tens of thousands of contractors. 45+ countries. A new timesheet email every minute. The back office could not scale another headcount — so we built seven agents instead.”

1. The math that forces agents

Workforce firms in energy & engineering live on a paradox. Their margin comes from the contractors they place. Their cost comes from the back office that pays them.

A timesheet in this industry is rarely a clean web form. It arrives as an email attachment — often a PDF, sometimes a scanned photo from a rig, occasionally a sideways fax with a signature in pen. The numbers on it determine real money: a week of work at a UOM-specific rate code, multiplied by tens of thousands of active contractors, multiplied by the dozens of countries the firm operates in.

Process this manually and the back office becomes a perpetual hire. Each timesheet costs minutes; each error costs hours of downstream reconciliation. The volume scales linearly with the business — but the workforce that handles it scales sub-linearly only if you stop processing them by hand.

A single LLM call against an email attachment is not the answer. One prompt that does “read the document and book the hours” gives you confident, plausible, sometimes catastrophically wrong output. Hallucinated contractor names. Hours on a rate code the contractor isn't entitled to. A rotated page that the model read upside down and never flagged.

The answer is to break the work into specialists. Each agent does one thing. Each agent knows what good and bad look like for its own step. Each agent emits a structured result that the next agent — and a human reviewer — can inspect. That's the shape of the system below.

2. The pipeline, end to end

Seven agents process every timesheet — but only four of them are LLMs. The other three are deterministic on purpose.

Pipeline · one timesheet, seven agents

Email inbox

A contractor PDF lands in the shared mailbox

Webhook with HMAC-signed payload → BullMQ email-ingestion queue

1. Email classifier

GPT-4o

Is this even a timesheet?

Filters noise — auto-replies, bounce notifications, signature receipts, accidental forwards.

2. Vision extractor

GPT-4o

OCR + vision read of the attachment

Tesseract OSD detects page orientation and auto-rotates before the LLM sees it. A faxed sideways PDF no longer hallucinates.

On failure: short-circuit → downstream agents stamped 'skipped' (no false-green)

3. Structured parser

GPT-4o

Raw text → typed TimesheetData

Schema-driven extraction into rows, dates, hours, rate codes, project codes, contractor identifiers.

4. Contractor resolver

Postgres

Name → contractor_id

Deterministic Postgres lookup with trigram fuzzy match. Not an LLM. The model can hallucinate a name; it can't hallucinate a primary key.

5. Business-rule validator

Logic

Tenant rules, dates, anomalies

Pure Python: working-day counts, weekend flags, missing-day detection, hours-vs-rate-code consistency.

6. Rate-code mapper

GPT-4o-mini

Map each contractor rate code to a row

A cheap second pass. The hard pattern matching the LLM is actually good at — but at 1/15th the cost of the main model. Floors weak matches at confidence 0.6 → surfaces ambiguities instead of guessing.

7. Confidence router

Logic

Approve · Review · Reject

Per-tenant thresholds. The agent that decides what humans see.

Approved

Auto-export

Goes straight to Navision on the next scheduled run.

Manual review

Human queue

Verification screen with per-agent click-to-detail drawer.

Rejected

Sender notified

Templated reply with the failing reason.

Built on LangGraph (Python 3.12), Azure OpenAI GPT-4o for vision + structured extraction, GPT-4o-mini for the rate-code second pass, Fastify 5 + BullMQ for queue durability, PostgreSQL 16 for state and audit, OpenTelemetry traces across every hop.

A timesheet enters via an HMAC-signed webhook (5-minute replay window, rolling two-secret rotation). The webhook persists an EmailIngestionrecord and enqueues a BullMQ job. Durability is non-negotiable: between the inbox and the contractor's payslip, nothing can be lost.

3. The agent that isn't an LLM

The fourth agent in the pipeline is the most important one for accuracy. It resolves a contractor name into a contractor primary key. The temptation, especially early in any agentic project, is to let the LLM do it. It's text matching. The model is great at text matching.

Don't. The model can hallucinate a name. It cannot hallucinate a primary key.

The contractor resolver is a deterministic PostgreSQL query with a pg_trgm fuzzy-match. Country-scoped. Returns a confidence score based on similarity, not LLM self-reporting. If the score is below a threshold, the agent fails the step explicitly and the next agents see step_status: failed — they don't paper over the gap.

Rule of thumb: if a task has a ground-truth table, route to the table. Reserve the LLM for the parts where there isn't one.

4. The cheap second pass

The sixth agent does something interesting: it runs a cheaper model on a narrower task. After the main pipeline has extracted the structured timesheet, a second pass (gpt-4o-mini) matches each contractor rate code to at most one row in the timesheet.

Rate-code matching is the part of the workflow where LLMs actually shine — it's soft pattern matching across labels that humans write inconsistently. But it's also the part where running the most expensive model in the chain would be a waste. The mini model handles it at roughly 1/15th the cost of the main pipeline.

Two safety nets sit on top of the LLM:

  • A confidence floor. The model self-reports a confidence for each match. Anything below 0.6 is dropped to null and surfaced as an ambiguity in the result envelope. The downstream router will route the timesheet to manual review rather than guess.
  • Per-UOM business rules. Once matched, the row's units-of-measure determine the final number: BASICM → 1 unconditional, BasicD → working-day count, MONTH → 1 if the row has data, HOUR/AMOUNT → row total verbatim. The LLM proposes the match; deterministic rules pick the number.

Pattern to steal: small model for the soft match; deterministic rules for the arithmetic that pays the contractor.

5. The confidence router — where the savings live

If you only remember one piece of architecture from this post, make it the seventh agent.

The confidence router takes the assembled result — extractedData plus every agent's confidence and status — and produces one of three outcomes:

Approved

Confidence above tenant threshold across every step. Goes straight to Navision on the next scheduled export run. The human only sees the audit trail.

Manual review

One or more steps flagged warn/ambiguous. Lands in the verification queue. The reviewer sees a 4-state indicator per agent (ok / warn / failed / skipped); clicking opens a per-agent drawer with the inputs and the reasoning.

Rejected

Pipeline detected a hard failure — non-timesheet email, unreadable scan, contractor missing from the country's register. Templated reply goes back to the sender; no human touches it.

Crucially, the thresholds are tenant-configured — stored in system_configs rather than hardcoded — so a strict country can tighten the bar without a redeploy. Tune one number, ship a measurable change in the auto-approve rate the next day.

The savings don't come from the agents being smart. They come from the router knowing what the agents don't know.

6. Observability isn't a feature. It's the product.

A reviewer in the back office does not trust a system that shows a green checkmark and a number. They trust one that shows why it produced the number.

Every agent in the pipeline emits an AgentResult into a LangGraph state reducer: step name, status, inputs, decisions, reasoning. The verification screen reads that reducer and renders one indicator per step. Click an indicator and a drawer slides in with everything the agent saw and everything it decided.

This is enforced by a small piece of architecture nobody celebrates: a short-circuit node. When the vision extractor fails — a sideways image with no readable text — the router skips the remaining agents and stamps them skipped rather than letting them fire on empty input and silently succeed. The bug it fixes is “cascading green”: downstream LLMs returning confident outputs based on nothing.

Across the stack, OpenTelemetry traces propagate from the Fastify API through W3C traceparent headers into the FastAPI AI service. A single trace renders end-to-end in Jaeger — from webhook receipt to Navision export. When something goes wrong at 2am, the trace already has the answer.

Most agentic systems hide the agents. The product is the opposite — it surfaces them.

7. One platform, many countries

The workforce firm operates across 45+ countries. Each one has its own rate code library, its own labour law, its own working week (Sunday-start vs Monday-start), its own export target inside Navision, and its own threshold for what auto-approve should mean.

Tenant in this system semantically isa country. Every authenticated request carries two values: the user's home country (where they administratively sit) and an active country (which one they're operating in right now, persisted in local storage and attached as an X-Country-Id header). Every tenant-scoped database query filters by the active value, not the home one. PostgreSQL row-level security enforces it at the DB tier — defence in depth against a missed where clause.

Onboarding a new country is a database row + a configuration file + a connection test. No code change. Two countries went live first; the pattern scales to the next forty.

8. What we learned

  • Specialised > general. Seven agents that each do one thing well outperform one prompt that tries to do everything, even when “the everything prompt” uses the smartest model available. Bound the problem, bound the failure.
  • Not every agent is an LLM. Of the seven, three are deterministic. The most accuracy-critical one — contractor resolution — is a Postgres lookup. The cheapest one to scale linearly with volume is also non-LLM. The LLMs are the soft layer; the spine is deterministic.
  • Cheap second pass beats expensive one shot. Pulling out one narrow task and giving it to a smaller model routinely saves an order of magnitude. The “route-by-task” pattern compounds.
  • The router is the system. Confidence-based three-way routing (approve / review / reject) is what makes the economics work. Tune one threshold, change the auto-approve rate, change the back-office headcount.
  • Show your work. Per-agent observability — click an indicator, see the reasoning — converts “trust the AI” into “verify the AI in five seconds.” Reviewers process more, faster, with less anxiety.
  • No silent failures. Short-circuit on the failing step. Stamp downstream agents skipped. Cascading green is the failure mode that kills trust fastest.
  • Queue durability is the unsexy invariant. HMAC-signed webhook + BullMQ retries + stale-job detector means nothing in flight is lost. The customer doesn't see this; their finance team would see the absence of it.

9. Outcomes

Projected savings: $1M+ annually from compressed back-office processing alone, before the downstream reconciliation savings. Two countries live; the pipeline is identical for every additional one.

The deeper outcome is structural. The firm now processes contractor timesheets as a software workload, not a staffing workload. The line on their cost sheet that scales with revenue is engineering, not headcount.

10. What's next

  • Country fan-out. Onboard the remaining countries from the current 2 to the full operating footprint. Most of the engineering is configuration; some is per-country rate-code library population.
  • Anomaly detection on the export side. A learned baseline of “normal” per-contractor hours so a 3-sigma deviation triggers manual review even when every individual step is confident.
  • Per-country model routing. Some countries see workloads where a cheaper model is obviously sufficient; the rate-code-mapper pattern extends to the vision step too.
  • Conversational verification. A reviewer who finds an error doesn't just fix the row — they tell the system why, and the next batch benefits from the correction. This is where this platform starts to look like its sibling HRMS product.

The agent is not the product. The router is the product. The reviewer's trust in what the router did is the product.

Build the spine deterministic. Reserve the LLM for the soft middle. Make the seams visible. Let the threshold do the work.

Building an agentic system at enterprise scale? Let's talk. Or read the companion piece on HONO Zero UI's architecture.

Back to portfolio