— Reference architecture

The agent harness, fully wired.

Eight components. Containerized. Production-shaped.

— Components

What each piece does

ComponentToolWhat it solvesContainer / Port
Context engineeringMarkdown policies + pgvector RAGAgent does not know your business rulespgvector :5434
Tool accessMCP serversAgent cannot read or write your datalocal stdio
OrchestrationClaude Agent SDKMulti-step reasoning across ticketsin-process
GuardrailsNeMo GuardrailsProbabilistic safety is not enoughin-process
MemoryMem0 + pgvectorCustomer re-explains every interactionpgvector :5434
Cost controlsLiteLLM proxyTicket spike drains AI budget:4000 (PG :5435)
Observability + evalsLangfuse self-hostedYou cannot debug what you cannot see:3000 (PG :5436)
SandboxE2BTool execution touches productionper-ticket VM
— Data flow

How a ticket moves through

01

Ingest + input guardrails

Ticket text passes through NeMo input rails. Social engineering and PII leak attempts blocked deterministically before the model sees them.

02

Context assembly

Policies (markdown), retrieved knowledge base articles (RAG over pgvector), and conversation memory (Mem0) merged into the system prompt.

03

Agent loop

Claude Agent SDK runs the think-act-observe cycle. Each tool call goes through MCP into the per-ticket E2B sandbox.

04

Cost + observability cross-cuts

Every model call routes through LiteLLM (model selection, budget enforcement, fallback). Every span streams to Langfuse for trace + eval scoring.

05

Output guardrails + reply

Final response passes through output rails (regex + LLM-judged). Approved reply ships to the customer; rejected output triggers escalation.

— Workshop deployment

What runs on your laptop

Each workshop step is additive. Step 00–04 is in-process Python only. Step 05 introduces containers (pgvector for RAG + Mem0). Step 06 adds a model-routing proxy. Step 07 adds observability infra. Step 08 reaches into E2B's cloud for per-ticket sandboxes. Total local footprint at the end: ~10 containers, ~16 GB RAM recommended.

Container reference

ContainerPortStepPurpose
pgvector543405PostgreSQL + pgvector for RAG embeddings + Mem0 conversation memory
Neo4j7474 / 768705Graph store for Mem0 — currently idle (proxy tool_choice mismatch)
LiteLLM400006Model-routing proxy with virtual keys, budgets, fallbacks. Admin UI at /ui
LiteLLM PG543506PostgreSQL backing LiteLLM (keys, budgets, usage)
Langfuse web300007Self-hosted observability UI — session traces, datasets, LLM-as-judge evaluators
Langfuse PG543607PostgreSQL backing Langfuse (projects, users, prompts, scores)
ClickHouseinternal07High-cardinality columnar store for traces / spans
Redisinternal07Langfuse queue + cache
MinIO909007S3-compatible blob storage for trace payloads
Langfuse workerinternal07Async eval-job runner
E2B sandboxcloud08Per-ticket VM for tool execution (not yet built)
Footprint

~10 containers running simultaneously by step 07. Recommend 16 GB RAM. If memory pressure shows up, stop Neo4j (idle): docker compose -f steps/05-memory/docker-compose.yml stop neo4j.

Step-by-step bring-up. You don't need everything at once. Each step's docker-compose.yml brings up only that step's containers; prior steps run from previously-started containers. Reset a step with make reset S=05.

— AWS reference

If you built this on AWS

Same eight components, mapped to canonical AWS services. Synthesized from production deployments (Robinhood, Epsilon, Rede Mater Dei) and AWS official guidance (Bedrock AgentCore, Strands SDK, Well-Architected GenAI Lens).

Component mapping

Harness pieceAWS serviceWhy this
Context engineeringBedrock Knowledge Bases + AgentCore GatewayKB handles full RAG (ingest → embed → retrieve). Gateway's semantic tool discovery prevents context bloat across hundreds of APIs.
Tool accessAgentCore Gateway (OpenAPI → MCP) + Lambda action groupsGateway converts any OpenAPI-spec API to MCP without code. Lambda is the escape hatch for custom logic.
OrchestrationAgentCore Runtime + Strands Agents SDKRuntime is the framework-agnostic substrate (Firecracker microVMs, A2A protocol). Strands is AWS's recommended SDK for new projects (14M+ downloads, GA 1.0).
GuardrailsBedrock Guardrails + AgentCore PolicyTwo enforcement planes: content safety (what the model says) + behavioral rules in Cedar (what the agent does with tools).
MemoryAgentCore MemoryShort-term session events (TTL up to 365d) + long-term async-extracted facts. KMS-encrypted. No DIY vector DB.
Cost controlsStep Functions + DynamoDB + CloudWatch + Budgets + prompt cachingNo native budget knob. Step Functions gates each inference against DynamoDB token budgets. Claude prompt caching ≈ 90% savings on re-used context.
Observability + evalsAgentCore Observability + AgentCore EvaluationsOTEL traces in CloudWatch Transaction Search. 13 built-in evaluators (correctness, safety, goal success). Optional sinks: Datadog, Langfuse.
Sandbox / isolationAgentCore Code Interpreter + VPC + PrivateLinkFirecracker microVM. Use VPC mode (not Sandbox mode — DNS exfil risk per Unit 42 research). Bedrock PrivateLink keeps inference off public internet.

How a request moves through AWS

01

Ingress: CloudFront + WAF + API Gateway

Request hits the edge: CloudFront for global delivery and DDoS shielding, AWS WAF for rate-based and content-rule filtering. Then Amazon API Gateway handles auth (Cognito / IAM / Lambda authorizer), routing, and per-API-key usage plans for request-count throttling. This is the AWS analogue of Azure APIM — except request-count throttling, not token-aware throttling, lives here.

02

Cost gate + input guardrails (the GenAI Gateway pattern)

Because API Gateway has no native token-aware quota policy, AWS production architectures compose a "GenAI Gateway": a Lambda Authorizer (or Step Functions step) that reads per-tenant token budgets from DynamoDB, rejects over-budget requests with 429 + Retry-After before any inference cost, and tracks consumption async. Approved requests then pass through Bedrock Guardrails for content filters, denied-topics, prompt-attack detection, and PII redaction. CloudWatch Alarms + AWS Budgets fire on threshold breach. Bedrock prompt caching cuts re-used context cost ≈ 90%.

03

Agent boots in AgentCore Runtime

Firecracker microVM, session-isolated. Strands SDK starts the think-act-observe loop. AgentCore Identity propagates the user's OAuth token (or M2M token) to every downstream call so the audit trail stays coherent end-to-end.

04

Context assembly: RAG + tool catalog + memory

Bedrock Knowledge Bases retrieves relevant chunks via S3 Vectors. AgentCore Gateway's semantic tool discovery surfaces only the tools relevant to this task — not the full catalog — so the context window stays lean. AgentCore Memory loads session events plus extracted long-term facts (preferences, prior summaries).

05

Tool calls + behavioral policy + code execution

Every tool call routes through AgentCore Gateway (OpenAPI → MCP). AgentCore Policy evaluates the call against Cedar rules in real time — allow, deny, or escalate. Code blocks execute inside AgentCore Code Interpreter (Firecracker microVM, VPC mode for full network isolation).

06

Output guardrails + observability

The agent's response passes back through Bedrock Guardrails for output-side filtering. Every span (LLM call, tool call, policy decision) streams via OpenTelemetry to AgentCore Observability → CloudWatch Transaction Search. AgentCore Evaluations asynchronously samples live traffic and scores it on 13 built-in evaluators (correctness, helpfulness, goal success, safety, context relevance).

Watch for

No APIM-equivalent on AWS. Azure APIM bundles auth, request throttling, and the AI-aware azure-openai-token-limit policy in one service. AWS has no single equivalent — production teams compose API Gateway (request count) + Lambda Authorizer + DynamoDB (token count) + Bedrock Guardrails. This is the "GenAI Gateway" pattern AWS publishes for multi-tenant deployments.

Sandbox mode ≠ network isolation. AgentCore Code Interpreter Sandbox Mode allows DNS resolution by design — confirmed exfil vector. Use VPC Mode for any regulated workload.

No native offline eval registry. AgentCore Evaluations samples live traffic only. For dataset-driven offline evals, teams reach for Langfuse, Braintrust, or S3 + Athena.

AgentCore · Strands 1.0 · Well-Architected GenAI Lens · Proactive cost management

— Azure reference

If you built this on Azure

Same eight components, mapped to canonical Azure services. Synthesized from Microsoft Foundry Agent Service production guidance, Azure Architecture Center reference architectures, and Microsoft product team blog posts (April 2026).

Component mapping

Harness pieceAzure serviceWhy this
Context engineeringAzure AI Search (Agentic Retrieval) + Foundry IQNative vector + hybrid + semantic ranking. Agentic Retrieval (GA 2025) breaks compound questions into sub-queries — built for agent consumption.
Tool accessFoundry Toolbox + Azure Functions (MCP webhook)Toolbox centralizes versioned tool definitions on a single MCP endpoint. Functions exposes custom tools via /runtime/webhooks/mcp. Managed identity + OBO auth native.
OrchestrationFoundry Agent Service + Microsoft Agent FrameworkFoundry is the managed runtime (hosting, scaling, threads). Agent Framework (preview Oct 2025) merges Semantic Kernel + AutoGen — sequential, concurrent, group-chat, magentic patterns.
GuardrailsFoundry Guardrails (Content Safety + Prompt Shields)Four explicit intervention points: input, tool call, tool response, output. Prompt Shields handles direct + indirect prompt injection. PII detection still preview.
MemoryCosmos DB (BYO) + Foundry Managed MemoryCosmos containers: thread-message-store, system-thread-message-store, agent-entity-store. Foundry's managed layer (preview) is user-scoped, zero-infra.
Cost controlsAPIM azure-openai-token-limit + Cost Management Budgets + Automation runbooksTwo-tier: APIM hard rate limits per consumer (TPM/quota → 429); Cost Management triggers Action Groups → runbook to disable a deployment when spend breaches budget.
Observability + evalsFoundry Observability (GA) + Application InsightsOTEL traces, agent monitoring dashboard, built-in evaluators (groundedness, relevance, tool-call accuracy), continuous prod evaluation, AI red-teaming agent (PyRIT). CI/CD quality gates.
Sandbox / isolationACA Dynamic Sessions + VNet + Private EndpointsHyper-V per session (stronger than container-only), millisecond cold start, MCP endpoint native (isMCPServerEnabled: true). Confidential Computing (TEE) in preview.

How a request moves through Azure

01

Ingress through APIM AI Gateway (optionally fronted by Front Door + WAF)

Production deployments commonly add Azure Front Door (global edge, DDoS) or Application Gateway (regional) with WAF in front of APIM for content-rule filtering. APIM is where the AI-aware concerns live: the azure-openai-token-limit policy enforces per-consumer TPM and token-count quotas natively (the unique AWS gap — see below), authentication via managed identity / OAuth2, OpenAPI-based routing. Over-budget consumers get 429 + Retry-After before any agent code runs.

02

Foundry Guardrails — input intervention (point 1 of 4)

Content Safety classifiers run on the user input. Prompt Shields catches direct jailbreak attempts and indirect prompt injection (XPIA) embedded in documents or tool responses. Harm categories (Hate, Sexual, Violence, Self-harm), task adherence, and protected-material checks fire first.

03

Agent runs in Foundry Agent Service

Managed runtime — hosting, scaling, identity, thread management. Microsoft Agent Framework (Semantic Kernel + AutoGen merged) drives the loop and supports sequential, concurrent, group-chat, handoff, and magentic patterns. Agent managed identity + OBO auth applied for downstream calls.

04

Context assembly: AI Search + Toolbox + memory

Foundry Toolbox surfaces versioned tools through a single MCP-compatible endpoint. Azure AI Search Agentic Retrieval breaks compound questions into sub-queries, runs vector + hybrid + semantic ranking, and merges results — built for agent consumption. Cosmos DB thread-message-store and agent-entity-store load session memory; Foundry Managed Memory adds user-scoped long-term context.

05

Tool calls re-intercepted + code in ACA Dynamic Sessions

Each tool call (intervention point 2) and tool response (intervention point 3) is screened by Foundry Guardrails — Azure's two extra interception points beyond AWS. Code execution lands in Azure Container Apps Dynamic Sessions: Hyper-V-isolated per-session sandbox, started in milliseconds from a pre-warmed pool, exposed as a remote MCP server.

06

Output intervention + observability

Final response passes through Foundry Guardrails one more time (intervention point 4). OpenTelemetry traces stream to Foundry Observability + Application Insights — agent monitoring dashboard, distributed traces, built-in evaluators (groundedness, relevance, tool-call accuracy, task completion), and continuous production evaluation. AI red-teaming agent (PyRIT) runs scheduled adversarial tests; Cost Management Budgets can trigger Action Groups → Automation runbooks to disable a deployment when spend breaches threshold.

Watch for

Hosted agents are North Central US only in preview as of April 2026. EMEA production deployments need prompt-agent type with private networking — plan around the residency gap.

Guardrails differ materially from AWS. Azure intercepts at four points (input · tool call · tool response · output) — more agent-native than Bedrock's two-plane model. But Azure's PII detection is preview and bias detection is absent.

Sandbox is purpose-built. ACA Dynamic Sessions is the closest thing to E2B in any major cloud — Hyper-V isolated, MCP-native, sub-second startup. AWS has no first-class equivalent.

Foundry Agent Service · Foundry Guardrails · ACA Dynamic Sessions · Foundry Observability