MPBench Exposes Memory Poisoning Attacks in LLM Agents

AI relevance: Persistent memory systems in LLM agents create a new attack surface where a single adversarial write can exert long-term behavioral influence, and existing prompt injection defenses don't cover this threat class.

  • Researchers present a systematic study of memory poisoning in LLM-based agents, identifying four memory write channels and nine structural vulnerabilities across model capabilities, system prompt design, and agent architecture.
  • The team developed a taxonomy of six memory poisoning attack classes and introduced MPBench, a benchmark for evaluating these attacks against agent systems.
  • Key finding: agents designed to write and retrieve memory more aggressively are significantly more exploitable — the very features that make agents useful also make them vulnerable.
  • Existing prompt injection defenses (input filtering, output validation, sandboxing) fail to cover memory poisoning attacks because the threat operates through persistent state rather than immediate execution.
  • A single adversarial memory write can exert long-term influence over agent behavior, persisting across sessions and affecting future decisions without triggering immediate detection.
  • The nine vulnerabilities span three categories: model capability gaps (inability to distinguish trusted vs untrusted sources), system prompt design flaws (overly permissive memory write instructions), and architectural weaknesses (lack of memory isolation or provenance tracking).
  • The four memory write channels identified: direct user input, tool output, retrieved context, and inter-agent communication — each with distinct attack vectors and mitigation requirements.
  • The paper proposes non-malleable, origin-bound authority for agent long-term memory with machine-checked guarantees, achieving zero success against memory-laundering attacks in formal verification.
  • MPBench provides standardized evaluation scenarios for testing agent resilience to memory poisoning, enabling comparative security assessment across different agent architectures.

Why it matters

Memory is what makes agents useful across sessions — they accumulate context, learn preferences, and build institutional knowledge. But that same persistence creates a new attack surface that traditional security controls don't address. A prompt injection defense might catch malicious input in real-time, but if that input gets written to memory and retrieved in a future session, the attack succeeds despite the guardrail. This research quantifies the gap between "agent is secure during this conversation" and "agent's memory hasn't been poisoned for future conversations." For enterprises deploying agents with persistent memory, this is a critical blind spot.

What to do

  • If you're building agents with persistent memory, audit your memory write paths: which inputs can trigger writes, and do you validate provenance before storing?
  • Test your agent against MPBench scenarios to understand your exposure to each of the six attack classes identified in the taxonomy.
  • Consider memory isolation strategies: separate high-trust memories (user preferences, authenticated data) from low-trust memories (web content, tool outputs) with different write permissions.
  • Implement memory provenance tracking — every stored fact should carry metadata about its source, timestamp, and confidence level.
  • For red teams: memory poisoning is a new attack class. Test whether your target agent's memory can be poisoned through tool outputs, retrieved documents, or inter-agent messages.

Sources: