arXiv — AgentDoG: a diagnostic guardrail framework for AI agent safety and security
• Category: Research
- AgentDoG (Diagnostic Guardrail for Agents) is a new open-source framework that monitors AI agent trajectories — the full sequence of tool calls, reasoning steps, and environment interactions — and provides fine-grained diagnostic labels rather than simple safe/unsafe binary outputs.
- The paper introduces a three-dimensional risk taxonomy that classifies agentic risks along orthogonal axes: source (where the risk originates — user input, tool output, environment), failure mode (how it manifests — misuse, overreach, deception), and consequence (what happens — data leak, unauthorized action, system compromise).
- Built on this taxonomy, a new benchmark called ATBench provides structured evaluation scenarios covering complex multi-step agentic interactions — going beyond single-turn prompt-injection tests to capture real-world agent failure chains.
- AgentDoG models are released in three sizes (4B, 7B, 8B parameters) across Qwen and Llama families, making deployment practical across different compute budgets — from edge inference to datacenter monitoring.
- A key differentiator is root-cause diagnosis: when AgentDoG flags an action, it explains the provenance — which step in the trajectory introduced the risk, what failure mode applies, and what downstream consequence is expected. This is designed to support iterative agent alignment rather than just blocking actions.
- The framework can also detect "seemingly safe but unreasonable" actions — cases where an agent's intermediate step doesn't violate any explicit rule but is logically inconsistent with the stated goal, a pattern common in tool-shadowing and intent-drift attacks.
- Experimental results show AgentDoG achieves state-of-the-art performance in agentic safety moderation across diverse interactive scenarios, outperforming prior guardrail approaches that lack trajectory-level context.
- All models, datasets (ATBench), and code are publicly released — 43 co-authors from multiple institutions contributed.
Why it matters
- Current guardrail solutions (Llama Guard, NeMo Guardrails, etc.) primarily operate at the input/output level. AgentDoG is one of the first open frameworks purpose-built for the agentic context — monitoring entire tool-call trajectories, not just individual prompts.
- The diagnostic approach directly addresses the opacity problem in agent security: knowing that something is unsafe is less useful than knowing why and where in the chain the risk was introduced, especially for teams building and debugging agent systems.
- With models as small as 4B parameters, AgentDoG can run as a sidecar alongside production agents without requiring a separate GPU cluster — lowering the barrier to deploying runtime safety monitoring.
What to do
- Evaluate ATBench against your current guardrail stack to identify gaps in trajectory-level risk detection — especially for multi-tool workflows where single-turn filters miss compounding risks.
- Test AgentDoG models as a complement (not replacement) to existing input/output guardrails — use them for the reasoning-layer monitoring that traditional guardrails can't cover.
- Adopt the three-dimensional taxonomy for internal agent risk assessments: classifying risks by source × failure mode × consequence gives more actionable results than flat severity ratings.
- Integrate trajectory logging: AgentDoG requires access to the agent's full action sequence — if your agent framework doesn't emit structured trajectory data, start instrumenting it now.