arXiv — MUZZLE red-teaming web agents against indirect prompt injection

• Category: Research

AI relevance: The paper directly targets indirect prompt injection against web agents that browse, click, and execute actions on users’ behalf.

  • MUZZLE is an adaptive, automated red-teaming framework for web agents exposed to indirect prompt injection.
  • It uses the agent’s execution trajectory to find high-salience injection surfaces and craft context-aware instructions.
  • The system iteratively refines attacks using feedback from failed executions rather than fixed templates.
  • Across multiple apps and configurations, MUZZLE reports 37 new attacks spanning 10 adversarial objectives.
  • The authors highlight violations of confidentiality, integrity, availability, and privacy properties.
  • Novel tactics include cross-application prompt injection and agent-tailored phishing scenarios.

Why it matters

  • Static prompt-injection test suites may understate real risk for tool-using web agents.
  • Adaptive attacks can find weak spots unique to each agent configuration and task flow.
  • Defenders need to model untrusted web content as an active adversary, not just noise.

What to do

  • Test agents with adaptive red teams that can modify attacks after failures.
  • Limit high-privilege actions behind human confirmation or policy gates.
  • Instrument provenance and sandboxing to reduce exposure from untrusted web inputs.

Links