arXiv — IterInject: Feedback-Guided Iterative Prompt Injection Against Agents

AI relevance: IterInject demonstrates that indirect prompt injection payloads can self-evolve via closed-loop LLM optimization — moving from static one-shot attacks to adaptive campaigns that learn from each failed attempt against real agent defenses.

What the paper does

Researchers from Shanghai Jiao Tong University and HKU introduce IterInject, a framework that treats indirect prompt injection as an iterative optimization problem rather than a static payload craft. An LLM-based optimizer refines payloads conditioned on structured diagnostic feedback from a rule-based diagnoser, and a synthesis step generates new disguise seeds from observed failure patterns.

  • Feedback loop: For each target, the framework injects a payload, collects a four-level diagnostic label (Success, Partial, Detected, Ignored) plus a behavioral description, then feeds both into the optimizer for the next iteration.
  • Self-evolving seed bank: Disguise templates are seeded from large-scale public red-teaming submissions, then extended per setting. Failed patterns trigger synthesis of new templates, expanding the strategy space beyond the initial set.
  • Benchmark results: On AgentDojo (510 instances, four task suites) IterInject achieves the highest attack success rate across four victim models, with the largest gain on DeepSeek (47.8% vs 32.9% for static baselines). On InjectAgent, Total ASR rises from near-zero to 33–90%.
  • Real agent target: Extension experiments on Claude Code — a production-grade coding agent with layered defenses — show optimized payloads achieving full success on 5 of 9 targets. Resistant targets still exhibit measurable improvement (e.g., advancing from Ignored to Partial).
  • Mechanistic analysis: The team identifies an attention-mediated threshold mechanism for IPI in mid-to-late model layers on Qwen3.5-27B, validated through three causal interventions that point toward concrete defense directions.

Why it matters

Most indirect prompt injection research assumes static or heuristically-mutated payloads. IterInject closes a critical gap: it shows that an attacker with an optimization loop can adapt to agent-specific defenses in real time. This shifts the threat model from "can a single crafted payload bypass the filter" to "can repeated probing converge on a bypass" — which is closer to how real attackers operate against production systems.

The Claude Code results are especially relevant for coding-agent deployments: even with multiple defensive layers, 5 of 9 targets were fully compromised through iterative refinement alone.

What to do

  • Treat one-shot detection as insufficient. Agents that process untrusted external content need runtime guardrails that survive repeated adversarial probing, not just a static filter.
  • Minimize the attack surface. Restrict which external data sources agents can read; apply content sanitization before feeding tool outputs into prompts.
  • Monitor for partial success. Even advancing from Ignored to Partial indicates a weakening defense — treat incremental gains as signals of exposure erosion.

Sources