Prompt Injection as Role Confusion — CoT Forgery Achieves 60% ASR on Frontier Models
AI relevance: This paper explains why prompt injection persists despite extensive safety training — models confuse attacker-crafted text for trusted instructions based on how it sounds, not where it comes from, directly impacting agent security design.
Charles Ye and colleagues released an updated version (v4, April 15) of "Prompt Injection as Role Confusion" (arxiv:2603.12277), tracing the fundamental cause of prompt injection vulnerabilities to how language models internally represent textual authority.
Key findings
- Role confusion is structural: Models infer the source of text based on stylistic cues — syntactic patterns, lexical choice — rather than actual provenance. A command hidden in a webpage hijacks an agent because it sounds like a user instruction.
- Internal representations confirm it: In the model's activation space, text that sounds like a trusted source occupies the same region as text that actually is from a trusted source. The model literally cannot distinguish them internally.
- CoT Forgery — a zero-shot attack: The researchers inject fabricated chain-of-thought reasoning into user prompts or ingested webpages. Models mistake the forged reasoning for their own thoughts, achieving 60% attack success on StrongREJECT across frontier models (near-0% baseline for comparable attacks).
- Role confusion predicts attack success: The degree of role confusion measured by the authors' "role probes" strongly correlates with attack success rate — providing a measurable signal for vulnerability assessment.
- Unifying framework: The paper reframes prompt injection not as an ad-hoc exploit but as a measurable consequence of how models represent role — generalizing across standard agent prompt injections.
Why it matters
Most prompt injection defenses rely on delimiters, system prompt hardening, or output filtering — all treating the symptom rather than the cause. If the model's internal representation of "who is speaking" is fundamentally confused by attacker-controllable signals, then superficial defenses will continue to fail. CoT Forgery is particularly concerning because it targets the model's own reasoning process, making it invisible to output-based detection.
What to do
- Recognize that delimiter-based defenses address surface syntax, not the underlying role confusion mechanism.
- Implement defense-in-depth: combine input sanitization, output validation, and tool-level authorization (least privilege for agent tools).
- Monitor for role-probe-like signals in model activations as an early-warning research direction.
- Design agent architectures that separate untrusted data processing from privileged tool execution.