Kill-Chain Canaries — arXiv:2603.28013 Stage-Level Prompt Injection Tracking
AI relevance: Kill-Chain Canaries reframes prompt injection as a pipeline-architecture problem, showing that the critical defense decision in multi-agent systems is not which model you pick — but where you place write-stage verification in the pipeline.
Haochuan Kevin Wang (MIT) and Zechen Zhang (University of Chicago) published arXiv:2603.28013, introducing a novel evaluation methodology that tracks prompt injection through four discrete stages — Exposed, Persisted, Relayed, and Executed — rather than relying on the traditional binary "did it succeed?" metric.
Key findings
- 950 runs across 5 frontier models: Claude, GPT-4o-mini, DeepSeek, and others tested across six attack surfaces including web text, memory, tool streams, PDF, invisible PDF, and audio injection.
- Claude blocks all injections at memory-write (0/164 ASR): While every model is fully exposed to injection at the input stage, Claude's behavior diverges dramatically at the write/persistence stage — stopping all attacks before they propagate.
- GPT-4o-mini propagates at 53%: The same injection that Claude fully neutralizes passes through GPT-4o-mini's memory-write stage in over half of cases.
- DeepSeek shows 0%/100% split across surfaces: From the same model family, ASR ranges from zero to complete failure depending on the injection channel — proving that surface coverage, not just model choice, determines safety posture.
- Write-node placement is the highest-leverage decision: Routing memory writes through a verified model eliminates propagation entirely, regardless of downstream model vulnerabilities.
- All four tested defenses fail on at least one surface: Channel mismatch alone causes defense failure — no adversarial adaptation required.
- Invisible whitefont PDF payloads match or exceed visible-text ASR: Rendered-layer screening is insufficient; attackers can hide injections in document layers that humans never see but LLMs process fully.
- Real-world applicability: The research specifically addresses institutional NLP pipelines over earnings calls, SEC filings, and analyst reports — document-ingestion workflows now migrating to LLM agents.
Why it matters
The traditional approach to evaluating prompt injection defenses reports a single attack success rate. This paper demonstrates why that's inadequate: a 0% ASR could mean the injection was blocked at the write stage, or it could mean it survived but the terminal agent refused to execute — architecturally very different outcomes. The kill-chain canary methodology gives engineers the diagnostic visibility needed to harden real pipelines rather than chasing aggregate metrics. For anyone deploying multi-agent systems, the finding that write-stage routing through a verified model is the single highest-leverage safety decision should reshape architecture decisions.
What to do
- Map your agent pipeline's write stages — where does external content enter persistent storage or get passed to downstream agents?
- Consider routing memory-write operations through a verified/defended model, even if other stages use different models.
- Test document ingestion pipelines against invisible PDF payloads, not just visible text injection.
- Review the open-source code and run logs (GitHub) to adapt the evaluation methodology to your own agent architectures.