arXiv — MOSAIC-Bench: Coding Agents Ship Exploitable Code via Sequenced Prompts
AI relevance: This research reveals a fundamental weakness in how coding agent safety systems evaluate prompts in isolation — attackers can split a malicious objective across multiple routine tickets that individually pass safety review but collectively produce exploitable code.
What the paper shows
- MOSAIC-Bench (Malicious Objectives Sequenced As Innocuous Compliance) introduces 199 three-stage attack chains across 10 web-application substrates, 31 CWE classes, and 5 programming languages, with deterministic exploit oracles.
- Nine production coding agents from Anthropic, OpenAI, Google, Moonshot, Zhipu, and Minimax compose innocuous tickets into exploitable code at 53–86% end-to-end attack success rate — with only two refusals across all staged runs.
- In a matched direct-prompt experiment, vulnerable-output rates for the same frontier Claude and Codex agents fall to 0–20.4%: Claude primarily refuses, while Codex hardens its output. Ticket staging silences both defense modes simultaneously.
- Downstream code reviewer agents approve 25.8% of confirmed-vulnerable cumulative diffs as routine pull requests.
- A full-context implementation protocol (providing the agent with all tickets at once) closes only 50% of the staged-vs-direct gap, meaning context fragmentation alone does not explain the vulnerability.
- As a non-adaptive mitigation, reframing the reviewer as an adversarial pentester reduces evasion to 3.0–17.6%, and an open-weight Gemma-4-E4B-it reviewer under this framing detects 88.4% of attacks on the dataset with a 4.6% false-positive rate on 608 real-world GitHub PRs.
Why it matters
Organizations deploying coding agents in CI/CD pipelines assume that per-prompt safety alignment protects them. MOSAIC-Bench demonstrates that an attacker (or malicious insider) can achieve the same result as a direct malicious prompt by sequencing seemingly benign engineering tickets — a technique that bypasses current safety evaluation frameworks. This is especially relevant for enterprise AI development workflows where tasks are naturally decomposed into tickets.
What to do
- Evaluate multi-turn code review: If your organization uses AI coding agents, review diffs at the aggregate level, not per-commit. A single commit may be benign while the cumulative change introduces a vulnerability.
- Consider adversarial reviewer framing: The paper shows that framing code reviewers as pentesters significantly improves detection rates of staged attacks.
- Monitor for ticket sequencing patterns: Be alert to unusual patterns where multiple small, innocuous-looking changes accumulate toward a vulnerable end-state.
- Read the paper: The full methodology and dataset are available for teams wanting to test their own coding agent deployments against MOSAIC-Bench.