Skip to content
Cinnabar / Labs

Applied AI. Methodology, not buzzword.

Labs is the practice that puts AI inside real systems. Custom agent platforms, evaluation infrastructure, retrieval systems, and AI-native internal tooling — built with the same rigour we bring to the rest of your stack, with confidence routing and human-in-the-loop guardrails by default.

What Labs is

The practice that earns AI its keep inside real systems.

Labs is our applied-AI practice. We build agent platforms, evaluation infrastructure, RAG and retrieval systems, and the internal tooling that keeps an AI system honest. The shape of the work is closer to platform engineering than to prompt engineering — most of the value lives outside the model call.

We treat AI features the way we treat any other piece of the stack: with evals on the inputs, traces on the outputs, guardrails between, and a clear escalation path to a human when confidence drops. The version of AI that's worth shipping is the version that lets your team sleep through the night.

Our Labs engagements are typically smaller than our Build ones — three to five contributors, six to twelve months — and they often run alongside an active Build engagement rather than instead of one.

Services

What a Labs engagement covers.

  • Custom agent platforms

    End-to-end agent systems for high-stakes workflows — claims triage, fraud detection, vendor approval, knowledge work. Confidence routing, human review queues, and a real audit trail behind every decision.

  • Evaluation infrastructure

    The harness that turns AI from a demo into a system you can ship. Held-out historical data, slice analysis, automated regression on every prompt and model change, and a dashboard your team can trust before launch.

  • RAG & retrieval systems

    Document and knowledge retrieval that survives contact with real corpora. Chunking strategies tuned to the domain, hybrid search where it earns its keep, and citation-grade provenance on every answer.

  • AI-native internal tools

    The internal surfaces that make AI useful to your team — review queues, prompt-and-response inspectors, prompt versioning, drift dashboards. Designed with the same care as the product the AI lives behind.

  • Model fine-tuning

    When base models aren't enough — supervised fine-tuning, preference tuning, and distillation onto smaller models. Always justified by an eval, never by a vibe.

  • AI advisory

    Short, paid engagements when a partner needs a sober second opinion on architecture, vendor choice, or eval strategy. We say no to AI features that don't earn their keep, often.

How we approach it

Five principles that shape every Labs engagement.

01

Evals before agents.

Before we ship a single agent, we build an evaluation harness against the existing system's historical decisions. Every model and prompt change is scored against real labelled data — accuracy, drift, and fairness slices. The eval set is the ground truth, not the demo.

02

Confidence routing by default.

Every decision request runs through an ensemble. When models agree above a configurable threshold, the decision is auto-applied. Below it, the request routes to a human reviewer with the model's reasoning attached. The threshold is yours to set, and it changes with the system.

03

Auditable trace, end to end.

Every decision — automated or routed — is stored with its full input, prompt, model output, confidence, and any reviewer override. Full replay, blame, and trend analysis from the operations dashboard. AI doesn't get to be the inscrutable black box at the end of your stack.

04

Buy where it makes sense, build where it doesn't.

Frontier models from Anthropic, OpenAI, and the open-weight families — we don't have a vendor we're trying to sell. The choice is driven by the evals, the latency budget, the cost per decision, and where the data is allowed to live.

05

AI as a tool, not a pitch.

We use AI extensively in our own work and we build it into our partners' systems where it earns its keep. We're equally happy telling you that an LLM is the wrong answer for your problem.

Tools we reach for
  • TypeScript
  • Python
  • Anthropic Claude
  • OpenAI
  • Open-weight models
  • PostgreSQL + pgvector
  • Temporal
  • OpenTelemetry
  • Eval harnesses
  • vLLM
  • Ray
How it starts

An honest conversation about whether AI is the right answer.

We say no to AI features more often than we say yes. The first conversation is usually about whether the problem you're describing is actually a model problem, a UX problem, or a data problem in disguise. If it is a model problem, we'll tell you what the eval would have to look like before we'd be willing to ship it.

If you're still where we are by the end of that conversation, we follow with a paid one-week diagnostic — a real eval against your historical data, an architecture recommendation, and an honest read on cost and latency.

Ready when you are.

A 30-minute conversation. We'll listen. If we're a fit, we'll say so. If not, we'll point you to someone who is.

No discovery decks · No sales calls · One conversation