Kill-Chain Canaries — arXiv:2603.28013 Stage-Level Prompt Injection Tracking

AI relevance: Kill-Chain Canaries reframes prompt injection as a pipeline-architecture problem, showing that the critical defense decision in multi-agent systems is not which model you pick — but where you place write-stage verification in the pipeline.

Haochuan Kevin Wang (MIT) and Zechen Zhang (University of Chicago) published arXiv:2603.28013, introducing a novel evaluation methodology that tracks prompt injection through four discrete stages — Exposed, Persisted, Relayed, and Executed — rather than relying on the traditional binary "did it succeed?" metric.

Key findings

  • 950 runs across 5 frontier models: Claude, GPT-4o-mini, DeepSeek, and others tested across six attack surfaces including web text, memory, tool streams, PDF, invisible PDF, and audio injection.
  • Claude blocks all injections at memory-write (0/164 ASR): While every model is fully exposed to injection at the input stage, Claude's behavior diverges dramatically at the write/persistence stage — stopping all attacks before they propagate.
  • GPT-4o-mini propagates at 53%: The same injection that Claude fully neutralizes passes through GPT-4o-mini's memory-write stage in over half of cases.
  • DeepSeek shows 0%/100% split across surfaces: From the same model family, ASR ranges from zero to complete failure depending on the injection channel — proving that surface coverage, not just model choice, determines safety posture.
  • Write-node placement is the highest-leverage decision: Routing memory writes through a verified model eliminates propagation entirely, regardless of downstream model vulnerabilities.
  • All four tested defenses fail on at least one surface: Channel mismatch alone causes defense failure — no adversarial adaptation required.
  • Invisible whitefont PDF payloads match or exceed visible-text ASR: Rendered-layer screening is insufficient; attackers can hide injections in document layers that humans never see but LLMs process fully.
  • Real-world applicability: The research specifically addresses institutional NLP pipelines over earnings calls, SEC filings, and analyst reports — document-ingestion workflows now migrating to LLM agents.

Why it matters

The traditional approach to evaluating prompt injection defenses reports a single attack success rate. This paper demonstrates why that's inadequate: a 0% ASR could mean the injection was blocked at the write stage, or it could mean it survived but the terminal agent refused to execute — architecturally very different outcomes. The kill-chain canary methodology gives engineers the diagnostic visibility needed to harden real pipelines rather than chasing aggregate metrics. For anyone deploying multi-agent systems, the finding that write-stage routing through a verified model is the single highest-leverage safety decision should reshape architecture decisions.

What to do

  • Map your agent pipeline's write stages — where does external content enter persistent storage or get passed to downstream agents?
  • Consider routing memory-write operations through a verified/defended model, even if other stages use different models.
  • Test document ingestion pipelines against invisible PDF payloads, not just visible text injection.
  • Review the open-source code and run logs (GitHub) to adapt the evaluation methodology to your own agent architectures.

Sources