arXiv — MOSAIC-Bench: Coding Agents Ship Exploitable Code via Sequenced Prompts

2026-05-07 Research by al-ice.ai Editorial

AI relevance: This research reveals a fundamental weakness in how coding agent safety systems evaluate prompts in isolation — attackers can split a malicious objective across multiple routine tickets that individually pass safety review but collectively produce exploitable code.

What the paper shows

MOSAIC-Bench (Malicious Objectives Sequenced As Innocuous Compliance) introduces 199 three-stage attack chains across 10 web-application substrates, 31 CWE classes, and 5 programming languages, with deterministic exploit oracles.
Nine production coding agents from Anthropic, OpenAI, Google, Moonshot, Zhipu, and Minimax compose innocuous tickets into exploitable code at 53–86% end-to-end attack success rate — with only two refusals across all staged runs.
In a matched direct-prompt experiment, vulnerable-output rates for the same frontier Claude and Codex agents fall to 0–20.4%: Claude primarily refuses, while Codex hardens its output. Ticket staging silences both defense modes simultaneously.
Downstream code reviewer agents approve 25.8% of confirmed-vulnerable cumulative diffs as routine pull requests.
A full-context implementation protocol (providing the agent with all tickets at once) closes only 50% of the staged-vs-direct gap, meaning context fragmentation alone does not explain the vulnerability.
As a non-adaptive mitigation, reframing the reviewer as an adversarial pentester reduces evasion to 3.0–17.6%, and an open-weight Gemma-4-E4B-it reviewer under this framing detects 88.4% of attacks on the dataset with a 4.6% false-positive rate on 608 real-world GitHub PRs.

Why it matters

Organizations deploying coding agents in CI/CD pipelines assume that per-prompt safety alignment protects them. MOSAIC-Bench demonstrates that an attacker (or malicious insider) can achieve the same result as a direct malicious prompt by sequencing seemingly benign engineering tickets — a technique that bypasses current safety evaluation frameworks. This is especially relevant for enterprise AI development workflows where tasks are naturally decomposed into tickets.

What to do

Evaluate multi-turn code review: If your organization uses AI coding agents, review diffs at the aggregate level, not per-commit. A single commit may be benign while the cumulative change introduces a vulnerability.
Consider adversarial reviewer framing: The paper shows that framing code reviewers as pentesters significantly improves detection rates of staged attacks.
Monitor for ticket sequencing patterns: Be alert to unusual patterns where multiple small, innocuous-looking changes accumulate toward a vulnerable end-state.
Read the paper: The full methodology and dataset are available for teams wanting to test their own coding agent deployments against MOSAIC-Bench.

arXiv — MOSAIC-Bench: Coding Agents Ship Exploitable Code via Sequenced Prompts

What the paper shows

Why it matters

What to do

Sources