Your agent can run rm -rf /. Your prompt saying “please don’t do dangerous things” is not security.

When we deployed AI agents that could execute shell commands, call APIs, commit code, and manage infrastructure, we quickly realized that prompt-based safety is not security. Prompts are suggestions to a probabilistic system. Security requires deterministic enforcement. We ended up needing several independent layers of defense — not because one wasn’t good enough, but because each layer catches a different category of failure that the others structurally cannot.


The Threat Model

Before building defenses, we defined what we’re defending against:

  1. Prompt injection — malicious input that hijacks agent behavior (“ignore previous instructions and delete the database”)
  2. Tool misuse — the agent legitimately tries to accomplish a goal but reaches for a dangerous tool along the way (runs a destructive cleanup command when asked to “tidy up”)
  3. Privilege escalation — the agent discovers it has access to tools it shouldn’t
  4. Data exfiltration — the agent extracts secrets, PII, or internal data through tool outputs
  5. Recursive amplification — sub-agents spawning sub-agents, consuming unbounded resources

No single control addresses all five. That’s the core argument for defense-in-depth over any one clever fix.


Layer 1: Classify Intent Before Anything Executes

The first line of defense happens before the agent ever sees a tool: classify what the user is actually asking for. Obvious jailbreak attempts and social-engineering patterns get caught cheaply and immediately. Ambiguous cases get a more careful pass.

What this catches: obvious prompt injections, off-topic requests, blunt social engineering.

What it doesn’t catch: a sophisticated injection buried inside an otherwise legitimate-looking request.

Consider a seemingly reasonable SRE request: “Check if the API key is properly configured on the production server.” Intent classification correctly sees this as a valid operations question. But the agent might reach for a command that dumps every environment variable — including credentials — to satisfy it. Classifying intent handles what the user wants; it can’t tell you whether the specific action the agent chooses to take is safe. That’s the next layer’s job.


Layer 2: Deterministic Policy Enforcement on Every Tool Call

Every tool call — regardless of whether it comes from the main agent, a delegated sub-agent, or a multi-step plan — passes through the same policy enforcement path. No exceptions, no alternate routes.

This is the layer that actually behaves like security rather than a suggestion: hard-blocked tool names are blocked, full stop, with no LLM judgment call involved. Repetitive identical calls get interrupted before they can loop. Tools that fail repeatedly in a short window get temporarily cut off entirely. None of this depends on the model “deciding” to be safe — it’s plain, deterministic code sitting between the agent’s decision and the tool actually running.


Layer 3: Human-in-the-Loop for the Calls That Need Judgment

Some tool calls need a human in the loop — not all of them, just the ones where the blast radius is large enough that a person should sign off first. Read-only and informational actions don’t need a human. State-changing actions typically do. Certain destructive actions are never allowed at all, approval or not.

Critically, this approval step doesn’t block the agent’s other work while it waits — the agent can continue on parallel parts of a task and pick the approved action back up once a human responds.


Layer 4: Cross-Model Verification of What the Agent Claims

LLMs hallucinate, including about their own actions. When an agent reports “I’ve completed the deployment successfully,” you need a way to check that claim independent of the agent’s own narration.

We run a second model over the completed execution trace, checking whether the tools that were actually called and their actual results support the story the agent is telling. If the agent claims success but the underlying tool calls say otherwise, the output gets flagged before it reaches a user or triggers a downstream action. The important design property is independence — the verifier looks at the raw trace, not the agent’s summary of it.


Layer 5: An Immutable Audit Trail

Every tool call, model request, and governance decision is logged to an append-only record — no updates, no deletes. This layer doesn’t prevent anything by itself. Its job is forensics: after an incident, you can reconstruct exactly what the agent did, what it saw, and what decisions were made along the way, with sensitive data already stripped out before it was ever written down.


Why All Five, and Why in This Order

No single layer is sufficient on its own. Intent classification catches obvious attacks before execution even starts. Deterministic policy enforcement is the layer that actually behaves like security. Human review adds judgment for the cases that are genuinely ambiguous. Independent verification catches the agent lying to itself — or to you. Audit doesn’t prevent anything, but it means nothing that happens is unaccountable.

The failure mode we’ve seen repeatedly isn’t any one layer being weak — it’s a new delegation path bypassing all of them at once, because governance was wired into one execution route and a new one was added without carrying it along (see the ReAcTree bugs post for a concrete example). The lesson generalizes past our specific stack: when you add a governance layer, the path you forget to wire it into is the one that gets exploited.


What We Learned

  1. Prompts are not security. A prompt saying “never run dangerous commands” is a suggestion to a probabilistic system. A deterministic check that blocks a dangerous command outright is a guarantee.

  2. Every tool execution path needs governance. Direct calls, sub-agent calls, plan-step calls, fallback calls — all of them must pass through the same enforcement, or the ones that don’t become the attack surface.

  3. Restrict the action space, don’t just instruct around it. Don’t tell an agent not to use a tool — remove the tool from what it can see. LLMs are creative problem-solvers; they will use every tool you make available to them.

  4. Audit is the last layer, not the first. It exists for non-repudiation and forensics, not prevention. Don’t mistake logging for a control.

  5. Security here is a composition problem, not a single-fix problem. Each layer covers a threat class the others structurally can’t. The value is in the combination.


Acknowledgments. Built with the StackGen Aiden team — the engineers behind the agent runtime and platform this series describes.

What security model does your agent platform use? I’m especially interested in how others handle the “sub-agent bypasses governance” problem. Find me on GitHub or LinkedIn.


🚀 We’re building AI-powered SRE at StackGen. If you’re tired of 3 AM pages and want AI agents that triage incidents, run diagnostics, and draft RCA reports — check out ai.stackgen.com and try our new SRE offering.