arXiv: AI Agents May Always Fall for Prompt Injections

AI relevance: This paper proves that prompt injection in autonomous agents is fundamentally an impossibility problem — current defenses degrade legitimate contextual behavior while attackers achieve 96.7% success through contextual manipulation alone.

What the paper shows

  • Abdelnabi (ELLIS Institute Tübingen) and Bagdasarian (UMass Amherst) recast prompt injection through Contextual Integrity (CI), a privacy theory that evaluates whether information flows respect contextual norms.
  • They prove an impossibility result: any fixed "never do X" rule will block legitimate flows, while any "allow X" rule admits attacks that construct a context where X appears appropriate.
  • A CI-informed red-team loop targeting context parameters achieves a 96.7% attack success rate against an email assistant — compared to under 1% for static baseline attacks.
  • Existing prompt injection classifiers perform at near-chance (AUROC 0.43–0.59) when attacks use contextual manipulation rather than injection keywords like "ignore all instructions."
  • Agents fail to separate simultaneous information flows in a single message, letting authorization for one flow leak into another in up to 65% of cases.
  • Even safety-trained models like Meta SecAlign and frontier models (gpt-5.4, claude-sonnet-4-6, gemini-3-pro) are vulnerable to context-based attacks that contain no explicit adversarial instructions.
  • The paper references LLMail-Inject achieving privacy violations in up to 88% of cases and security breaches in up to 60% through multi-turn agent-to-agent discourse.

Why it matters

  • If the impossibility result holds, the industry's current defense paradigm — data-instruction separation, keyword-based detectors, system-level isolation — addresses only a shrinking fraction of the attack surface.
  • The paper shows that agent workflows inherently blur data and instructions: skills, memory, and third-party interactions are instructional by design, making clean separation impossible without breaking agentic workflows.
  • The authors propose CI-aware alignment as a principled evaluation framework, rather than a silver-bullet defense.

What to do

  • Shift from "block all injections" to "evaluate contextual appropriateness" — design guardrails that reason about sender identity, transmission principle, and normative legitimacy.
  • Test agent defenses against context-manipulation attacks, not just keyword-based injection prompts.
  • Limit agent access to external content where context manipulation is hardest to detect (email processing, document summarization, multi-agent communication).
  • Track the CI-aware alignment framework as a potential path forward for next-generation agent security.

Sources: