arXiv — ChatInject: abusing chat templates for prompt injection in LLM agents
• Category: Research
AI relevance: The paper demonstrates a higher-success prompt-injection method that targets how agentic LLMs format and interpret chat templates.
- ChatInject embeds malicious instructions inside content that mimics native chat-template structure, exploiting the model’s instruction-following bias rather than plain-text injections.
- The authors propose a multi-turn persuasion variant that primes agents across several turns to accept suspicious actions, not just a single injected response.
- Across benchmarks, the method boosts average attack success rates: 5.18% → 32.05% on AgentDojo and 15.13% → 45.90% on InjecAgent.
- Multi-turn dialogues reach an average 52.33% success rate on InjecAgent, indicating compounding risk in longer agent conversations.
- Chat-template payloads show transferability across models, remaining effective even when the target model’s internal template is unknown.
- The paper reports that prompt-based defenses are largely ineffective against the chat-template and multi-turn variants.
- The study focuses on indirect prompt injection — adversarial instructions embedded in external environment output that an agent ingests.
Why it matters
- Agent builders often rely on message-formatting as a safety boundary; ChatInject suggests that boundary can be co-opted by attackers who imitate template structure.
- Multi-turn persuasion increases real-world risk because production agents routinely operate over long conversations with external systems, not isolated prompts.
- Defenses that only scan raw text may miss payloads that look like “normal” chat formatting.
What to do
- Audit template handling in your agent framework: ensure system/user roles cannot be forged by untrusted inputs.
- Instrument multi-turn monitoring to detect gradual persuasion patterns that shift agent intent over time.
- Test against ChatInject-style prompts in red-team evaluations, especially for agents consuming external web or tool outputs.