Why can't content moderation alone protect agentic AI systems?

Agentic systems take actions—they call tools, read untrusted content, write to systems of record, and chain steps together autonomously. A content-safety classifier can catch toxic output but will not prevent an agent from exfiltrating customer records via a logging tool, executing a hallucinated financial transaction, or following malicious instructions embedded in a retrieved document. The risk surface of an agent extends well beyond what content moderation is designed to cover.

What is indirect prompt injection and why is it the top agentic AI risk in 2026?

Indirect prompt injection occurs when malicious instructions arrive inside content the agent retrieves—such as a support ticket, email, calendar invite, PDF, or MCP tool description—and the agent then carries those instructions out as if they came from the user. The OWASP Top 10 for Agentic Applications 2026 names this as the leading attack class against agents, alongside agent goal hijacking, tool misuse, and data exfiltration.

What are the five control points in an agentic guardrails architecture?

A defense-in-depth guardrail architecture places controls at five points in the agent loop: input (user prompts and retrieved content), output (model responses before they reach a tool or user), action (each tool invocation, checked for authorization, argument validity, rate, and cost), retrieval (the vector store and RAG layer, including ACLs and injection detection), and orchestration (caps on iteration count, total tokens, total spend, and wall-clock time to catch scope creep that step-level checks miss).

Which open-source frameworks should I use to build a guardrail stack?

A practical starting stack includes Llama Prompt Guard 2 and an LLM judge for input-level prompt injection detection, Microsoft Presidio for PII and PHI detection and redaction, Llama Guard 4 12B for output content safety, and Pydantic with Instructor for structured tool-call argument validation. For orchestration-level controls and centralized policy enforcement, NeMo Guardrails and proxy layers such as Invariant Gateway or agentgateway are recommended additions. No single framework covers the full risk surface; the approach is to compose from multiple tools.

When should a guardrail be hosted as a separate service rather than run in-process?

Host a guardrail as a separate service when you need centralized policy management across multiple agents—for example, a single PII redaction service, a shared content moderation model, or a policy-as-prompt evaluator that multiple teams call. This approach lets one team own the policy, update it without redeploying every agent, and see breach rates across the entire fleet in one place. Cheap, latency-critical checks such as schema validation and regex-based secrets detection should remain in-process to avoid unnecessary network overhead.

How does Domino's Governance Center support guardrail compliance and audit?

Domino's Governance Center provides an immutable audit trail of every guardrail outcome, versioned policy storage alongside agent code in the project repo, and deployment gating that can block production releases until guardrail thresholds are met and required approvals are granted. Scheduled evaluation jobs run adversarial and functional tests against production traces continuously, with results visible in the agent's Performance tab. This produces the record-keeping, human oversight, and adversarial robustness evidence required by frameworks such as the EU AI Act, NIST AI RMF, and ISO 42001.

Composable guardrails for agentic AI

Overview and goals

The challenge

Picture an AI-powered customer support operation: the agent reads an incoming ticket, fetches the attached PDF for context, calls a refund tool with an erroneous $50,000 argument, and moves on to the next ticket. No hallucination, no content-safety violation. The agent did exactly what the retrieved document instructed it to do, and the audit trail has nothing useful to say about why.

This is the failure mode that content moderation alone cannot catch. Agentic systems take actions. They call tools, read untrusted content, write to systems of record, spend money, and chain steps together without checking back in. This pattern widens the risk surface beyond what a content-safety classifier can cover. The dominant attack class against agents in 2026 is indirect prompt injection, in which malicious instructions arrive inside a retrieved document, a calendar invite, a support ticket, or an MCP (Model Context Protocol) tool description, and the agent then carries them out as if they came from the user. The OWASP Top 10 for Agentic Applications 2026 names this attack class at the top of the list, alongside agent goal hijacking, tool misuse, and data exfiltration.

The dominant operational failure is tool misuse: a hallucinated argument, a permission boundary the model did not know existed, a loop that exhausts a token budget overnight. A content guardrail catches toxic output. The same guardrail will let an agent exfiltrate customer records via a logging tool or schedule 10,000 support emails because a retrieved document told it to.

The solution

Domino deliberately does not ship a runtime guardrails framework. A clinical agent under HIPAA, a payments agent under PCI, and a defense agent under ITAR need guardrails that actively contradict one another, and the threat picture moves faster than any platform vendor's release schedule. Domino's job is the governance foundation, including the immutable audit trail, lineage from agent back to its data and code, and scheduled evaluations against production traffic. The policy content is yours to compose, using the right combination of input, output, action, retrieval, and orchestration guardrails for your risk surface. Every guardrail you compose is a traced function that enforces policy at runtime and produces a labeled span in the audit trail using the same instrumentation pattern.

The defense-in-depth approach is established from open-source frameworks. Instrument each guardrail as a traced evaluation, following the pattern from the Automated GenAI Tracing Blueprint, and host heavy or central-policy guardrails as their own Domino agents that other agents call over A2A (agent-to-agent protocol).

The Governance Center sits atop all this, gating production deployments of agents and guardrails against defined thresholds. Versioned policies, lineage from the production agent back to the experiment run that validated it, and scheduled adversarial evaluations alongside functional metrics all live there too.

When to consider guardrails and governance in Domino

Your agent takes irreversible actions. It writes to a system of record, sends communications, moves money, or schedules work with downstream consequences.
Your agent reads untrusted content. Email, web pages, customer support tickets, retrieved documents, MCP tool descriptions. These are the entry points for indirect prompt injection.
You operate in a regulated environment. Financial services, life sciences, defense, or EU AI Act high-risk categories. The audit trail is a regulatory artifact.
You run more than one agent in production. Per-team policies diverge quickly; centralizing guardrails pays off the moment you have more than one.
Your agent handles sensitive data. PII, PHI, PCI, ITAR, or secrets.
You need to defend a deployment decision. Model risk management, internal validation, regulator audit, or vendor security review. Without versioned policies and traced guardrail outcomes, the defense is anecdotal.

How to design and deploy guardrails in Domino

Step 1. Map your risk surface

Name what you are defending against before choosing tools. The OWASP Top 10 for Agentic Applications 2026 is the practitioner's checklist; it differs from the LLM Top 10 because agentic risk comes from autonomous action. The failure modes worth designing against include agent goal hijacking, direct and indirect prompt injection, tool misuse (including hallucinated arguments), data exfiltration via tool calls, scope creep, memory poisoning, and resource exhaustion.

Walk through your agent’s loop and identify five places where a control point can sit:

Input. The user prompt, plus any retrieved documents or prior tool results that will enter the model context, including content from sources you don’t fully control.
Output. Each model response, evaluated before it reaches a downstream tool, agent, or user.
Action. Each tool invocation, evaluated for authorization, argument validity, rate, and cost before it executes.
Retrieval. The vector store, the search API, and the RAG (retrieval-augmented generation) layer. Access control lists (ACLs), source whitelisting, and citation enforcement live here.
Orchestration. The loop itself. Cap iteration count, total tokens, total spend, total wall-clock time. This is where you catch scope creep that step-level checks miss.

Two design principles run through the rest of this blueprint. First, least-agency, the agentic extension of least-privilege, means giving the agent the minimum tools and permissions it needs and no more. Second, the Swiss Cheese model holds that no single guardrail is sufficient; defense comes from layering independent guardrails so their holes do not align.

Here's a practical tip: name your acceptable-risk profile up front. There is a real tradeoff between safety and utility, and stronger defenses cost capability. A clinical decision-support agent under HIPAA sits at the strict end, with human review on every irreversible action and zero tolerance for hallucinated drug names. A sell-side financial research assistant sits in the middle, with no execution authority and controls on outbound queries to prevent retrieved private content from leaking as arguments to external services. An internal IT helpdesk agent sits looser, with schema-validated tool calls and soft fallbacks on guardrail is triggered, rather than hard stops. Pick a posture that matches your threat tier, document it, and revisit it when the threat picture changes.

Step 2. Choose your guardrail stack

Compose guardrails from frameworks that already exist. None of them solve the whole problem, and you should not expect them to. Domino's tracing works across any agent framework, with built-in support for the popular ones (OpenAI Agents SDK, LangChain, Pydantic AI, LlamaIndex, and more) and manual instrumentation for the rest, so you can mix detectors from multiple frameworks in the same agent without losing the unified trace view.

If you have no existing constraints, start here:

Input: Llama Prompt Guard 2 (22M) plus an LLM judge for prompt injection, Presidio for PII
Output: Llama Guard 4 12B for content safety, Pydantic with Instructor for tool-call argument validation
Action: tool guardrails wrapping every tool call, with authorization, schema validation, rate limits, and cost meters implemented inside (OpenAI Agents SDK has the clearest pattern, with pydantic-ai-guardrails and LangGraph as equivalents)
Retrieval: chunk-level ACLs, citation requirements, abstention rules, prompt-injection detectors on every retrieved document
Orchestration: caps on iteration count, total tokens, total spend, total wall-clock time

Add NeMo Guardrails if you need Colang-style dialog policies. Add a proxy layer (Invariant Gateway, agentgateway) if you want centralized policy enforcement across agents. The rest of this section covers each category in depth, including alternatives where the default does not fit.

For input guardrails, Llama Prompt Guard 2 (22M and 86M classifier variants from Meta) is the standard drop-in detector for direct and indirect prompt injection; the 22M variant trades a small accuracy delta for roughly 75 percent lower latency and compute costs. Microsoft Presidio is the open-source default for detecting personally identifiable information (PII), protected health information (PHI), payment card industry data (PCI), and secrets, with separate detection (named-entity recognition plus pattern matching) and transformation (mask, redact, hash, encrypt) layers. NeMo Guardrails from NVIDIA covers denied topics, off-domain queries, and dialog rails through Colang, its domain-specific language for dialog policies.

For output guardrails, Llama Guard 4 12B handles content safety on both text and images in a single classifier. Pydantic schemas and Instructor turn structured-output validation into a contract with automatic retry on failure, the cheapest mitigation for hallucinated tool arguments. NeMo Guardrails and Amazon Bedrock Guardrails both ship grounding detectors for hallucination on RAG outputs.

For action guardrails, the framework matters less than the discipline. Wrap every tool call with an authorization check, schema validation on arguments, a rate limit, a cost meter, and a kill switch. OpenAI Agents SDK Guardrails has first-class input, output, and tool-call guardrails with optimistic concurrent execution: checks run in parallel with the agent so latency stays low when checks pass. LangGraph has native tool scoping and human-in-the-loop primitives. For Pydantic AI users, pydantic-ai-guardrails mirrors OpenAI's design. Invariant Gateway and the open-source agentgateway project run as proxies in front of model and tool APIs. MCP-scan from Invariant Labs is the right tool for auditing installed MCP servers for tool poisoning. Domino is also building an LLM Gateway in this category; reach out for early access.

For retrieval guardrails, enforce ACLs at the chunk level in your vector store. Add citation requirements to your prompts and an abstention rule when evidence is insufficient. Treat retrieved documents as untrusted input and run them through the same injection detectors you apply to user prompts. Also treat retrieved content as a source of leaks: filter or rewrite tool-call arguments so private retrieval context does not flow to external services as query parameters.

The pattern most mature teams converge on is hybrid. Run cheap, latency-critical guardrails (schema validation, regex secrets, denied-topic lookups) in process. Run heavy, semantic, or central-policy guardrails (Llama Guard, hallucination detectors, policy-as-prompt evaluators) as services or as agents your primary agent calls. The model-based detectors in that second group are typically open-weight; the Deploying Self-Hosted LLMs Blueprint covers sizing and deployment on Domino. The hybrid setup trades a network hop for centralized policy management and a single auditable surface.

Step 3. Build your own where off-the-shelf falls short

Off-the-shelf detectors cover the common cases. Your industry, your data, and your tools likely produce cases that the common detectors miss. Build the missing guardrails with the same discipline you use for any other model.

Start with the cheapest mechanism that works. A regex catches obvious account-number leakage, internal project codes, or banned URLs faster than any model. A keyword list catches in-policy versus out-of-policy topics. These run in microseconds, cost nothing to operate, and explain themselves at audit.

Move to a small classifier model when rules are too brittle. Detoxify (PyTorch, from Unitary) wraps Jigsaw's Toxic Comment Classification work into a Python import and runs on CPU. Any Hugging Face transformer classifier fine-tuned on toxic, biased, or off-policy examples works the same way. Sub-300M-parameter classifiers ship as standard packages and slot into a @add_tracing function like any other guardrail.

Fine-tune a domain-specific classifier when general detectors stall. Start from a small base (DistilBERT, RoBERTa, or ModernBERT) and train on the labeled examples your incident response team has already produced. The data you already have on prior failures serves as the guardrail against future failures.

Use an LLM as a judge for the semantic checks that deterministic rules miss. A smaller open-weight model (Llama 3.1 8B Instruct, Mistral 7B Instruct, Qwen 2.5 7B) hosted on Domino can evaluate "does this response match the policy" or "does this tool call match the user's intent" with structured output. See the Deploying Self-Hosted LLMs Blueprint for sizing and deployment of the judge model itself.

Whatever you build, build the eval (evaluation) set first. A guardrail without an eval set is a guess. The eval set lives in the project repo next to the guardrail code, gets versioned with it, and runs every time the guardrail changes. Without this discipline, you cannot tell whether tightening a threshold helped or hurt.

Step 4. Instrument guardrails as traced evaluations

The Domino-specific move is to put every guardrail inside a traced function. Use the same @add_tracing decorator you use to instrument the agent itself. There is no separate guardrails framework on Domino; the traced function does dual duty, enforcing policy at runtime and producing a labeled audit-trail span from the same call. Evaluations get logged against the resulting traces out of band. The example below shows the in-process pattern. Step 5 covers when to promote a guardrail to a hosted A2A service.

import asyncio
from domino.agents.tracing import add_tracing
from presidio_analyzer import AnalyzerEngine

# Tune PII_THRESHOLD per entity type and language using a labeled eval set.
# 0.7 is a reasonable starting point for English-language PII; lower values
# catch more, including false positives. The companion repo loads this from
# config.yaml so the threshold versions with the project.
PII_THRESHOLD = 0.7


class GuardrailBreach(Exception):
    """Raised when an input or output violates policy.

    Keep `details` keys stable. Downstream evaluation Jobs filter on them.
    """
    def __init__(self, message: str, rail: str = "", details: dict | None = None):
        super().__init__(message)
        self.rail = rail
        self.details = details or {}


analyzer = AnalyzerEngine()


@add_tracing(name="input_pii_check")
async def input_pii_check(user_prompt: str) -> dict:
    # Presidio's analyze() is synchronous and CPU-bound. Run it in a
    # worker thread so the agent's event loop is not blocked.
    findings = await asyncio.to_thread(
        analyzer.analyze, text=user_prompt, language="en"
    )
    over = [f for f in findings if f.score > PII_THRESHOLD]

    if over:
        raise GuardrailBreach(
            "PII detected in user prompt",
            rail="input_pii_check",
            details={
                "entity_types": sorted({f.entity_type for f in over}),
                "max_score": max(f.score for f in over),
                "count": len(over),
            },
        )
    return {"findings": findings}


@add_tracing(name="run_agent")
async def run_agent(user_prompt: str) -> str:
    try:
        await input_pii_check(user_prompt)
        return await call_underlying_agent(user_prompt)
    except GuardrailBreach:
        # The failed span is already in the trace. Route to a safe fallback
        # or queue for human review depending on policy.
        return safe_fallback_response()

Three things to notice:

The decorator is the standard one documented in Develop agentic systems. There is no separate guardrail tracing API; guardrails are traced functions that just happen to enforce policy.
The breach raises an exception. The agent's parent trace shows a failed span at that step, which run_agent catches and routes to a safe fallback. The trace data is written either way, so the audit trail captures both the catch and the outcome. Queuing the request for a human reviewer (the human-in-the-loop pattern) is the other common route from the same except clause.
The guardrail does not call log_evaluation() inline, so user-facing latency stays predictable.

For the evaluation logging itself, schedule a Domino Job that walks recent production traces with search_agent_traces() and attaches scores and labels via log_evaluation() (which lives in domino.agents.logging). Once a day is typical for compliance reporting; more often if you need near-real-time alerts.

Wrap action and output guardrails the same way. A tool call wrapper that checks authorization and validates arguments is a traced function. An output guardrail running Llama Guard on the model's response is a traced function. By the time your agent is built, every control point in Step 1 has its own span in the trace, ready for evaluation logging.

Here's a practical tip: in your evaluation Job, log labels you will actually filter on later. A useful starting taxonomy:

breach_category: pii, prompt_injection, off_policy, hallucination, tool_misuse
breach_severity: high, medium, low
enforcement_action: block, fallback, human_review, allow
entity_types_detected: list of specific types found, e.g., email, phone, ssn
policy_version: the guardrail policy active when the breach was logged

Six months later, a validation reviewer can answer "show me every high-severity PII breach since the v3 policy rollout" in a single query. A generic policy_violation=true cannot.

Step 5. Host high-stakes guardrails as agents

Some guardrails belong inside your agent process and some belong as separate services, each with different tradeoffs. The case for hosting a guardrail as a service or as its own agent is centralization. One PII redactor, one moderation model, one policy-as-prompt evaluator, called by every agent in the organization, lets a single team own policy and a single dashboard show breach rates across the fleet. The cost is a network hop (typically 30 to 200 milliseconds) and less context (the service sees the request but not the calling agent's full trajectory).

The pattern uses the same A2A primitives covered in the simple_agent_api_only repo. Each guardrail is a pydantic_ai agent exposed over the A2A protocol with agent.to_a2a(...), deployed on Domino with bearer-token auth, including token rotation, per-agent scope, and audit logging on every call.

Three guardrails worth hosting separate services:

A PII redaction service running Presidio behind an A2A endpoint. Called on every input and every output across every agent in the organization. One place to update entity recognizers and transformation policy.
A content moderation service running Llama Guard 4 12B on a dedicated Domino model endpoint. See the Deploying Self-Hosted LLMs Blueprint for sizing and deployment. Policy updates roll out without redeploying every agent.
A policy-as-prompt evaluator that takes a natural-language policy and an agent response, returns a structured judgment, and logs the result. Useful when policy changes faster than code.

The architecture is hybrid by design. Cheap input validation runs in the calling agent's process and short-circuits before any network call is made. Heavy semantic guardrails run as services, so the policy is centrally managed. Both emit traces in the same form, so the audit trail is unified regardless of where the guardrail is executed.

Step 6. Wire into Governance Center

Every guardrail artifact has a home in Domino's existing governance surface and can be mapped one to one.

Guardrail policies and prompts live in the project repo alongside agent code. Versioned, propagated with the project's environment, and reproducible from any prior commit.
Guardrail outcomes land as evaluations on traces, visible in the agent's Performance tab in production (and Experiment Manager during development) and queryable from code via search_agent_traces() at any time. The audit trail is automatic and immutable.
Governance policies require approvals before any agent ships to production. Risk officers define stages, required evidence, and automated checks, including that the calling agent's guardrails meet defined thresholds. Deployment can be blocked on production hardware until the specified approvals are granted.
Continuous monitoring runs scheduled evaluation Jobs against production traces. Adversarial pass rates appear alongside functional metrics in the agent's Performance tab. Regressions trigger alerts.

The same mechanics give validation teams something that checkbox compliance cannot. Every production interaction is a labeled trace, queryable by agent version, time window, or guardrail outcome, so review cycles work from the full set of real population interactions rather than a curated set of examples. Where the opening scenario's audit trail had nothing useful to say, every step is now in the trace.

The result is concrete on the regulatory side. EU AI Act, NIST AI RMF, and ISO 42001 all want the same evidence in different formats: record-keeping, human oversight, and adversarial robustness. Produce the artifacts once and present them as needed.

Step 7. Run adversarial evaluation continuously

AgentDojo is the leading open security benchmark and a good place to start. Built by ETH Zürich's SPYLab and Invariant Labs, it runs your agent through realistic workspace, travel, banking, and Slack tasks while injecting attacks at each tool boundary, then scores whether the agent stayed on task or was hijacked. Run it as a scheduled Domino Job against staging and against a small subset of production traffic. Use the same log_evaluation() mechanism so adversarial pass rates show up alongside functional metrics, with the same lineage back to the agent version that produced them.

Domain-specific eval sets matter where AgentDojo's surfaces do not match yours. A clinical agent needs medication-name confusion tests. A trading agent needs spoofed market data tests. Build them once and schedule them like any other Domino Job. For the umbrella workflow tying tracing, evaluation, deployment, and monitoring together, see the Domino docs to Build and evaluate agentic systems.

These seven steps add up to one pattern: compose guardrails from existing frameworks, wrap each in @add_tracing, host the heavy ones as agents over A2A, and govern the whole stack through the Governance Center. The ML engineer gets a buildable architecture. The platform lead gets a single auditable surface. The validation reviewer gets the evidence to defend a deploy decision. None of it depends on Domino shipping a runtime guardrails framework, and all of it survives the next time the threat picture shifts.

Composable guardrails for agentic AI using a defense-in-depth approach

Authors

Article topics

Intended audience

Source code repository