arXiv — MUZZLE red-teaming web agents against indirect prompt injection
• Category: Research
AI relevance: The paper directly targets indirect prompt injection against web agents that browse, click, and execute actions on users’ behalf.
- MUZZLE is an adaptive, automated red-teaming framework for web agents exposed to indirect prompt injection.
- It uses the agent’s execution trajectory to find high-salience injection surfaces and craft context-aware instructions.
- The system iteratively refines attacks using feedback from failed executions rather than fixed templates.
- Across multiple apps and configurations, MUZZLE reports 37 new attacks spanning 10 adversarial objectives.
- The authors highlight violations of confidentiality, integrity, availability, and privacy properties.
- Novel tactics include cross-application prompt injection and agent-tailored phishing scenarios.
Why it matters
- Static prompt-injection test suites may understate real risk for tool-using web agents.
- Adaptive attacks can find weak spots unique to each agent configuration and task flow.
- Defenders need to model untrusted web content as an active adversary, not just noise.
What to do
- Test agents with adaptive red teams that can modify attacks after failures.
- Limit high-privilege actions behind human confirmation or policy gates.
- Instrument provenance and sandboxing to reduce exposure from untrusted web inputs.