Cinnabar / Labs

Applied AI. Methodology, not buzzword.

Labs is the practice that puts AI inside real systems. Custom agent platforms, evaluation infrastructure, retrieval systems, and AI-native internal tooling — built with the same rigour we bring to the rest of your stack, with confidence routing and human-in-the-loop guardrails by default.

All capabilities

What Labs is

The practice that earns AI its keep inside real systems.

Labs is our applied-AI practice. We build agent platforms, evaluation infrastructure, RAG and retrieval systems, and the internal tooling that keeps an AI system honest. The shape of the work is closer to platform engineering than to prompt engineering — most of the value lives outside the model call.

We treat AI features the way we treat any other piece of the stack: with evals on the inputs, traces on the outputs, guardrails between, and a clear escalation path to a human when confidence drops. The version of AI that's worth shipping is the version that lets your team sleep through the night.

Our Labs engagements are typically smaller than our Build ones — three to five contributors, six to twelve months — and they often run alongside an active Build engagement rather than instead of one.

Services

What a Labs engagement covers.

Custom agent platforms
End-to-end agent systems for high-stakes workflows — claims triage, fraud detection, vendor approval, knowledge work. Confidence routing, human review queues, and a real audit trail behind every decision.
Evaluation infrastructure
The harness that turns AI from a demo into a system you can ship. Held-out historical data, slice analysis, automated regression on every prompt and model change, and a dashboard your team can trust before launch.
RAG & retrieval systems
Document and knowledge retrieval that survives contact with real corpora. Chunking strategies tuned to the domain, hybrid search where it earns its keep, and citation-grade provenance on every answer.
AI-native internal tools
The internal surfaces that make AI useful to your team — review queues, prompt-and-response inspectors, prompt versioning, drift dashboards. Designed with the same care as the product the AI lives behind.
Model fine-tuning
When base models aren't enough — supervised fine-tuning, preference tuning, and distillation onto smaller models. Always justified by an eval, never by a vibe.
AI advisory
Short, paid engagements when a partner needs a sober second opinion on architecture, vendor choice, or eval strategy. We say no to AI features that don't earn their keep, often.

How we approach it

Five principles that shape every Labs engagement.

Evals before agents.

Before we ship a single agent, we build an evaluation harness against the existing system's historical decisions. Every model and prompt change is scored against real labelled data — accuracy, drift, and fairness slices. The eval set is the ground truth, not the demo.

Confidence routing by default.

Every decision request runs through an ensemble. When models agree above a configurable threshold, the decision is auto-applied. Below it, the request routes to a human reviewer with the model's reasoning attached. The threshold is yours to set, and it changes with the system.

Auditable trace, end to end.

Every decision — automated or routed — is stored with its full input, prompt, model output, confidence, and any reviewer override. Full replay, blame, and trend analysis from the operations dashboard. AI doesn't get to be the inscrutable black box at the end of your stack.

Buy where it makes sense, build where it doesn't.

Frontier models from Anthropic, OpenAI, and the open-weight families — we don't have a vendor we're trying to sell. The choice is driven by the evals, the latency budget, the cost per decision, and where the data is allowed to live.

AI as a tool, not a pitch.

We use AI extensively in our own work and we build it into our partners' systems where it earns its keep. We're equally happy telling you that an LLM is the wrong answer for your problem.

Representative work · 2025

What this looks like, shipped.

AI · Enterprise

Halcyon — An agent platform for ops decisions.

Replaced 14 legacy operational decision systems with a single agent platform. Confidence routing keeps humans in the loop where it matters.

3.4MDecisions per month

11 monthsDuration

3 engineers · 1 AI researcher · 1 designerTeam

Read the case

Tools we reach for

TypeScript
Python
Anthropic Claude
OpenAI
Open-weight models
PostgreSQL + pgvector
Temporal
OpenTelemetry
Eval harnesses
vLLM
Ray

How it starts

An honest conversation about whether AI is the right answer.

We say no to AI features more often than we say yes. The first conversation is usually about whether the problem you're describing is actually a model problem, a UX problem, or a data problem in disguise. If it is a model problem, we'll tell you what the eval would have to look like before we'd be willing to ship it.

If you're still where we are by the end of that conversation, we follow with a paid one-week diagnostic — a real eval against your historical data, an architecture recommendation, and an honest read on cost and latency.

Book a Labs conversation See more work

The other three

One thing, in four shapes.

Ready when you are.

A 30-minute conversation. We'll listen. If we're a fit, we'll say so. If not, we'll point you to someone who is.

Book a 30-minute call hello@cinnabar.studio

No discovery decks · No sales calls · One conversation

Applied AI. Methodology, not buzzword.

The practice that earns AI its keep inside real systems.

Custom agent platforms

Evaluation infrastructure

RAG & retrieval systems

AI-native internal tools

Model fine-tuning

AI advisory