Eight components, each one fixing a specific failure. Read top to bottom or jump to a step.
A model is not a product.
An agent harness is the surrounding infrastructure that turns a language model into a production system. The model decides; the harness keeps the deciding safe, scoped, observable, and affordable.
Without a harness, every failure mode the model has becomes a failure mode of your business. With one, those failures become bugs you can debug and fix.
This guide walks through eight harness components. For each: what breaks without it, what tool fixes it, and what to look for in production.
The model knows nothing about your business until you tell it.
Context engineering is the practice of feeding the model the policies, identity, and reference material it needs to make correct decisions for your specific domain. Refund thresholds, escalation rules, brand voice, product catalog — all live in markdown files loaded into the system prompt.
Static policies always loaded. Knowledge base retrieved on demand via RAG.
## Refund authority
- Under $50: auto-approve
- $50-200: document reason and approve
- Over $200: escalate to human
Agent invents policies. Refunds anyone for any amount. Cannot answer product questions because it has no catalog access.
Markdown files for policies, RAG over pgvector for knowledge base, identity in CLAUDE.md.
A model that cannot read your data is a chatbot, not an agent.
Model Context Protocol (MCP) servers expose typed tools to the agent. Each tool wraps a database query, an API call, or a side effect. The agent discovers tools at runtime — no hardcoded function lists in the prompt.
Tools live behind MCP servers. Agent talks to MCP; MCP talks to the world.
Agent fabricates order numbers. Cannot verify customer identity. Cannot actually issue a refund — only describe what one would look like.
Custom MCP servers with typed schemas. Permission scopes baked into the server, not the prompt.
One model call cannot solve a multi-issue ticket.
The Claude Agent SDK runs a think-act-observe loop: the agent reasons, calls a tool, sees the result, reasons again. max_turns caps runaway loops. Mid-loop reflection improves multi-step coherence.
Agent tries to do everything in one response. Refunds the wrong order. Forgets to cancel the subscription after issuing the refund. Multi-issue tickets become single-issue replies.
Claude Agent SDK with explicit turn limits, escalation triggers, and tool-result reflection.
Probabilistic safety is not safety.
NeMo Guardrails enforces deterministic input and output rules via Colang. Social engineering, PII leaks, competitor mentions, and forbidden actions block before or after the model — not depending on the model.
Social engineering succeeds 30-60% of the time. Agent leaks PII when asked nicely. Competitor mentions go unchallenged.
NeMo Guardrails with Colang rules. Deterministic regex on output. LLM-judged categories layered on top.
A customer should not have to re-explain.
Mem0 stores conversation memory in pgvector. RAG over the same vector store retrieves knowledge base articles on demand. Token usage drops from "load everything" (~4k tokens) to "load relevant only" (~400-800 tokens).
Every interaction is stateless. Customer #142 explains the broken cable for the third time. Knowledge base loads in full on every turn — context window pressure, cost waste.
Mem0 for memory persistence. pgvector for RAG over knowledge articles. Embedding-driven retrieval scoped to the current ticket.
A spike should not drain your budget.
LiteLLM proxies all model calls. Virtual keys per team enforce per-day budgets. Tiered routing sends classification to Haiku, response generation to Sonnet, complex reasoning to Opus. Fallback chains handle rate limits.
One ticket spike on a holiday drains the AI budget overnight. Every call routes to the most expensive model. No way to cap per-team spend.
LiteLLM proxy with virtual keys, tiered routing, and per-team budgets enforced at the proxy layer.
You cannot debug what you cannot see.
Langfuse self-hosted captures every trace: nested spans for context assembly, RAG retrieval, memory recall, each LLM turn, each tool call. Eval scores attach to traces via LLM-as-judge running against your written policies.
Agent issues 47 refunds at 3am. You learn about it from a customer ticket the next day. No idea which turn went wrong, which tool returned bad data, or which policy the model violated.
Langfuse session-level traces, datasets for systematic evaluation, LLM-as-judge evaluators scored against policy criteria.
A bug in a tool should not touch production.
Each ticket spawns a fresh E2B sandbox. The agent's tool calls execute inside the sandbox, not on the host. Side effects are contained: a malformed SQL query, a runaway loop, or a prompt injection cannot escape into the real customer database.
Tool execution runs against your live database. A bug in issue_refund mutates production state. Prompt injection becomes RCE on the host.
E2B sandbox per ticket. Disposable filesystem. Scoped credentials. Multi-tenant isolation by construction.
The full harness, one diagram.
Eight components. Each one preventing a specific failure. Together they form the boundary that turns a chatbot into a production agent.
→ Open the full reference architecture
Agent loop and tool orchestration.
Typed tool access to your data.
Deterministic input + output rails.
Conversation memory + RAG.
Cost controls + model routing.
Self-hosted traces + evals.
Per-ticket sandbox.
Vector + graph storage.