arXiv — ChatInject: abusing chat templates for prompt injection in LLM agents

• Category: Research

AI relevance: The paper demonstrates a higher-success prompt-injection method that targets how agentic LLMs format and interpret chat templates.

  • ChatInject embeds malicious instructions inside content that mimics native chat-template structure, exploiting the model’s instruction-following bias rather than plain-text injections.
  • The authors propose a multi-turn persuasion variant that primes agents across several turns to accept suspicious actions, not just a single injected response.
  • Across benchmarks, the method boosts average attack success rates: 5.18% → 32.05% on AgentDojo and 15.13% → 45.90% on InjecAgent.
  • Multi-turn dialogues reach an average 52.33% success rate on InjecAgent, indicating compounding risk in longer agent conversations.
  • Chat-template payloads show transferability across models, remaining effective even when the target model’s internal template is unknown.
  • The paper reports that prompt-based defenses are largely ineffective against the chat-template and multi-turn variants.
  • The study focuses on indirect prompt injection — adversarial instructions embedded in external environment output that an agent ingests.

Why it matters

  • Agent builders often rely on message-formatting as a safety boundary; ChatInject suggests that boundary can be co-opted by attackers who imitate template structure.
  • Multi-turn persuasion increases real-world risk because production agents routinely operate over long conversations with external systems, not isolated prompts.
  • Defenses that only scan raw text may miss payloads that look like “normal” chat formatting.

What to do

  • Audit template handling in your agent framework: ensure system/user roles cannot be forged by untrusted inputs.
  • Instrument multi-turn monitoring to detect gradual persuasion patterns that shift agent intent over time.
  • Test against ChatInject-style prompts in red-team evaluations, especially for agents consuming external web or tool outputs.

Links