— Ramp-up guide

An agent harness, built piece by piece.

Eight components, each one fixing a specific failure. Read top to bottom or jump to a step.

00

What is an agent harness

An agent harness is the surrounding infrastructure that turns a language model into a production system. The model decides; the harness keeps the deciding safe, scoped, observable, and affordable.

Without a harness, every failure mode the model has becomes a failure mode of your business. With one, those failures become bugs you can debug and fix.

Frame

This guide walks through eight harness components. For each: what breaks without it, what tool fixes it, and what to look for in production.

01

Context engineering

Context engineering is the practice of feeding the model the policies, identity, and reference material it needs to make correct decisions for your specific domain. Refund thresholds, escalation rules, brand voice, product catalog — all live in markdown files loaded into the system prompt.

policies/refund-policy.md
## Refund authority
- Under $50: auto-approve
- $50-200: document reason and approve
- Over $200: escalate to human
What breaks without it

Agent invents policies. Refunds anyone for any amount. Cannot answer product questions because it has no catalog access.

What fixes it

Markdown files for policies, RAG over pgvector for knowledge base, identity in CLAUDE.md.

02

Tool access via MCP

Model Context Protocol (MCP) servers expose typed tools to the agent. Each tool wraps a database query, an API call, or a side effect. The agent discovers tools at runtime — no hardcoded function lists in the prompt.

What breaks without it

Agent fabricates order numbers. Cannot verify customer identity. Cannot actually issue a refund — only describe what one would look like.

What fixes it

Custom MCP servers with typed schemas. Permission scopes baked into the server, not the prompt.

03

Task orchestration

The Claude Agent SDK runs a think-act-observe loop: the agent reasons, calls a tool, sees the result, reasons again. max_turns caps runaway loops. Mid-loop reflection improves multi-step coherence.

What breaks without it

Agent tries to do everything in one response. Refunds the wrong order. Forgets to cancel the subscription after issuing the refund. Multi-issue tickets become single-issue replies.

What fixes it

Claude Agent SDK with explicit turn limits, escalation triggers, and tool-result reflection.

04

Guardrails

NeMo Guardrails enforces deterministic input and output rules via Colang. Social engineering, PII leaks, competitor mentions, and forbidden actions block before or after the model — not depending on the model.

What breaks without it

Social engineering succeeds 30-60% of the time. Agent leaks PII when asked nicely. Competitor mentions go unchallenged.

What fixes it

NeMo Guardrails with Colang rules. Deterministic regex on output. LLM-judged categories layered on top.

05

Memory + RAG

Mem0 stores conversation memory in pgvector. RAG over the same vector store retrieves knowledge base articles on demand. Token usage drops from "load everything" (~4k tokens) to "load relevant only" (~400-800 tokens).

What breaks without it

Every interaction is stateless. Customer #142 explains the broken cable for the third time. Knowledge base loads in full on every turn — context window pressure, cost waste.

What fixes it

Mem0 for memory persistence. pgvector for RAG over knowledge articles. Embedding-driven retrieval scoped to the current ticket.

06

Cost controls

LiteLLM proxies all model calls. Virtual keys per team enforce per-day budgets. Tiered routing sends classification to Haiku, response generation to Sonnet, complex reasoning to Opus. Fallback chains handle rate limits.

What breaks without it

One ticket spike on a holiday drains the AI budget overnight. Every call routes to the most expensive model. No way to cap per-team spend.

What fixes it

LiteLLM proxy with virtual keys, tiered routing, and per-team budgets enforced at the proxy layer.

07

Observability + evals

Langfuse self-hosted captures every trace: nested spans for context assembly, RAG retrieval, memory recall, each LLM turn, each tool call. Eval scores attach to traces via LLM-as-judge running against your written policies.

What breaks without it

Agent issues 47 refunds at 3am. You learn about it from a customer ticket the next day. No idea which turn went wrong, which tool returned bad data, or which policy the model violated.

What fixes it

Langfuse session-level traces, datasets for systematic evaluation, LLM-as-judge evaluators scored against policy criteria.

08

Per-ticket sandbox

Each ticket spawns a fresh E2B sandbox. The agent's tool calls execute inside the sandbox, not on the host. Side effects are contained: a malformed SQL query, a runaway loop, or a prompt injection cannot escape into the real customer database.

What breaks without it

Tool execution runs against your live database. A bug in issue_refund mutates production state. Prompt injection becomes RCE on the host.

What fixes it

E2B sandbox per ticket. Disposable filesystem. Scoped credentials. Multi-tenant isolation by construction.

09

Putting it together

Eight components. Each one preventing a specific failure. Together they form the boundary that turns a chatbot into a production agent.

→ Open the full reference architecture

The toolchain

Claude Agent SDK

Agent loop and tool orchestration.

APISDK
MCP servers

Typed tool access to your data.

Custom
NeMo Guardrails

Deterministic input + output rails.

Apache 2.0
Mem0

Conversation memory + RAG.

MIT
LiteLLM

Cost controls + model routing.

MIT
Langfuse

Self-hosted traces + evals.

MIT
E2B

Per-ticket sandbox.

Free tier
pgvector + Neo4j

Vector + graph storage.

PostgreSQL · GPL