arXiv: AI Agents May Always Fall for Prompt Injections
AI relevance: This paper proves that prompt injection in autonomous agents is fundamentally an impossibility problem — current defenses degrade legitimate contextual behavior while attackers achieve 96.7% success through contextual manipulation alone.
What the paper shows
- Abdelnabi (ELLIS Institute Tübingen) and Bagdasarian (UMass Amherst) recast prompt injection through Contextual Integrity (CI), a privacy theory that evaluates whether information flows respect contextual norms.
- They prove an impossibility result: any fixed "never do X" rule will block legitimate flows, while any "allow X" rule admits attacks that construct a context where X appears appropriate.
- A CI-informed red-team loop targeting context parameters achieves a 96.7% attack success rate against an email assistant — compared to under 1% for static baseline attacks.
- Existing prompt injection classifiers perform at near-chance (AUROC 0.43–0.59) when attacks use contextual manipulation rather than injection keywords like "ignore all instructions."
- Agents fail to separate simultaneous information flows in a single message, letting authorization for one flow leak into another in up to 65% of cases.
- Even safety-trained models like Meta SecAlign and frontier models (gpt-5.4, claude-sonnet-4-6, gemini-3-pro) are vulnerable to context-based attacks that contain no explicit adversarial instructions.
- The paper references LLMail-Inject achieving privacy violations in up to 88% of cases and security breaches in up to 60% through multi-turn agent-to-agent discourse.
Why it matters
- If the impossibility result holds, the industry's current defense paradigm — data-instruction separation, keyword-based detectors, system-level isolation — addresses only a shrinking fraction of the attack surface.
- The paper shows that agent workflows inherently blur data and instructions: skills, memory, and third-party interactions are instructional by design, making clean separation impossible without breaking agentic workflows.
- The authors propose CI-aware alignment as a principled evaluation framework, rather than a silver-bullet defense.
What to do
- Shift from "block all injections" to "evaluate contextual appropriateness" — design guardrails that reason about sender identity, transmission principle, and normative legitimacy.
- Test agent defenses against context-manipulation attacks, not just keyword-based injection prompts.
- Limit agent access to external content where context manipulation is hardest to detect (email processing, document summarization, multi-agent communication).
- Track the CI-aware alignment framework as a potential path forward for next-generation agent security.
Sources: