arXiv — MUZZLE red-teaming web agents against indirect prompt injection

2026-02-16 • Category: Research

AI relevance: The paper directly targets indirect prompt injection against web agents that browse, click, and execute actions on users’ behalf.

MUZZLE is an adaptive, automated red-teaming framework for web agents exposed to indirect prompt injection.
It uses the agent’s execution trajectory to find high-salience injection surfaces and craft context-aware instructions.
The system iteratively refines attacks using feedback from failed executions rather than fixed templates.
Across multiple apps and configurations, MUZZLE reports 37 new attacks spanning 10 adversarial objectives.
The authors highlight violations of confidentiality, integrity, availability, and privacy properties.
Novel tactics include cross-application prompt injection and agent-tailored phishing scenarios.

Why it matters

Static prompt-injection test suites may understate real risk for tool-using web agents.
Adaptive attacks can find weak spots unique to each agent configuration and task flow.
Defenders need to model untrusted web content as an active adversary, not just noise.

Test agents with adaptive red teams that can modify attacks after failures.
Limit high-privilege actions behind human confirmation or policy gates.
Instrument provenance and sandboxing to reduce exposure from untrusted web inputs.