arXiv: AI Agents May Always Fall for Prompt Injections

AI relevance: If current defense paradigms are fundamentally incomplete as this paper argues, every production AI agent deployment is operating with an unpatchable vulnerability class that grows as agents gain autonomy and tool access.

Published May 17, 2026 by Sahar Abdelnabi and Eugene Bagdasarian (arXiv:2605.17634), this paper attacks the core assumption behind prompt injection defenses: that data and instructions can be meaningfully separated in an agent's context window.

  • The paper shows that the prevailing defense paradigm — data-instruction separation — both fails to detect contextual manipulation attacks and degrades contextually appropriate behavior when tightened.
  • Recasts prompt injection through Contextual Integrity (CI), a privacy theory that judges information flow compliance with contextual norms, offering a structured decomposition into dimensions: sender identity, transmission principle, information type, and normative legitimacy.
  • Demonstrates an impossibility result: an adversary can always construct a context under which a blocked flow appears legitimate, while a defender who tightens norms will block genuinely legitimate flows.
  • Identifies three attack classes: misrepresenting information flow, manipulating contextual norms, and mixing multiple flows to create ambiguity.
  • Empirically validates across four axes including adversarial prompt injection, multi-agent discourse embedding (ConVerse framework), indirect injection, and agentic workflow attacks.
  • Concludes that current research addresses a "shrinking fraction" of future attack surfaces as agents become more autonomous.

Why it matters

This is a theoretical impossibility result, not an incremental finding. It suggests that prompt injection is not a bug to be patched but a structural property of agents that must reason about untrusted content. For security teams, this means defense-in-depth — not detection — is the only viable strategy: sandboxing, egress controls, tool-level authorization, and zero-trust identity for agents.

What to do

  • Stop treating prompt injection filters as security boundaries — they are best-effort detection with a known false-positive/false-negative tradeoff
  • Enforce tool-level authorization (e.g., Microsoft Agent Governance Toolkit policy evaluation) so injected commands fail at the enforcement layer
  • Network-segregate agent egress to limit data exfiltration even if injection succeeds
  • Treat agent memory and skill/plugin loading as untrusted third-party inputs requiring explicit policy evaluation

Sources