arXiv — Bypassing AI control protocols via Agent-as-a-Proxy attacks

2026-02-07 • Category: Research

AI relevance: The work targets agent monitoring defenses, showing bypasses that directly affect how AI systems validate tool use and intent.

The paper introduces Agent-as-a-Proxy attacks that treat the agent as a delivery mechanism for indirect prompt injection.
These attacks aim to bypass monitoring protocols that jointly evaluate chain-of-thought and tool-use actions.
The authors argue that even frontier-scale monitors are vulnerable, not just small overseers.
On the AgentDojo benchmark, the attack bypasses AlignmentCheck and Extract-and-Evaluate monitors across different monitoring LLMs.
Reported results show high attack success against monitors like Qwen2.5-72B when paired with capable agents.
The findings suggest monitoring-based defenses may be fundamentally fragile regardless of model scale.

Why it matters

Many agent platforms rely on monitoring as a safety layer; if it can be bypassed, tool misuse risk increases.
Proxy-style attacks resemble real-world workflows where agents relay untrusted content between systems.
Security teams need to test not just agents, but the monitor-agent interaction itself.

Red-team the monitor layer with proxy-style prompt injection scenarios.
Limit high-privilege tool scopes and require explicit human approval for sensitive actions.
Instrument provenance checks so monitors can weigh trust in the source of agent context.