Skip to content
All work
AI · Enterprise2025

An evaluated AI-agent platform handling 3.4 million ops decisions per month.

Replaced 14 legacy operational decision systems with a single agent platform. Confidence routing keeps humans in the loop where it matters.

Client
Halcyon Operations
Vertical
Enterprise AI / operations
Region
United States
Duration
11 months
Team
3 engineers · 1 AI researcher · 1 designer
Role
AI platform · evaluation infra · workflow design
Overview

What we walked into.

Halcyon's customer ran an operations group making 3.4 million decisions per month across 14 distinct rule-based systems — claims triage, fraud detection, refund eligibility, vendor approval, and more. Each system had its own team, its own thresholds, and its own inscrutable failure modes.

Replace fourteen ageing decision systems with a single agent-based platform — without losing the institutional knowledge encoded in their rules, and without the platform turning into the same opaque black box at a higher cost. Auditability and consistency were non-negotiable.

Approach

What we built.

01

Evaluation infrastructure first

Before we shipped a single agent, we built an evaluation harness against the existing systems' historical decisions. Every model and prompt change was scored against twelve months of real labelled data — accuracy, drift, and fairness slices.

02

Confidence routing

Each decision request runs through a small ensemble. When models agree above a configurable threshold, the decision is auto-applied. Below it, the request routes to a human reviewer with the model's reasoning attached.

03

Auditable trace

Every decision — automated or routed — is stored with its full input, prompt, model output, confidence, and any reviewer override. Full replay, blame, and trend analysis from the operations dashboard.

Outcome

What shipped.

Halcyon achieved 99.7% eval accuracy against the held-out historical set, with 78% of decisions automated and the remaining 22% routed to a much smaller human-review team. The fourteen legacy systems were retired in the eight months following launch.

3.4MDecisions per month
14Systems replaced
99.7%Eval accuracy
78%Automated rate
What surprised me most is that the platform got more accurate after launch — because the eval harness made every change measurable. We'd never had that before.
Aditi KrishnanDirector of Operations, Halcyon
Stack
  • TypeScript
  • Python
  • OpenAI
  • Anthropic
  • PostgreSQL
  • Temporal
  • OpenTelemetry

Ready when you are.

A 30-minute conversation. We'll listen. If we're a fit, we'll say so. If not, we'll point you to someone who is.

No discovery decks · No sales calls · One conversation