arXiv — Analysis of LLMs against prompt injection and jailbreak attacks

AI relevance: The paper benchmarks prompt-injection and jailbreak robustness across widely used open-source LLMs and shows that lightweight defenses used in production can be bypassed by longer, reasoning-heavy prompts.

  • Evaluates prompt-injection and jailbreak risks using a manually curated attack dataset.
  • Benchmarks multiple open-source families, including Phi, Mistral, DeepSeek-R1, Llama 3.2, Qwen, and Gemma.
  • Finds wide behavioral variance between models, including refusal responses and silent non-responsiveness.
  • Tests lightweight inference-time defenses that act as filters without retraining or heavy fine-tuning.
  • Reports that these defenses are consistently bypassed by long, reasoning-heavy prompts.
  • Highlights that “silent” failures can mask successful injections by suppressing obvious refusal signals.

Why it matters

  • Model choice changes real-world injection risk, so security posture varies across stacks.
  • Filter-only defenses are fragile against adaptive attacks, especially when prompts are long.

What to do

  • Benchmark your exact model with long-form, reasoning-heavy attacks—not just short jailbreak prompts.
  • Log silent failures (non-responses, truncation) to detect stealthy injection outcomes.
  • Layer defenses with tool allowlists, egress controls, and human-in-the-loop gates for high-risk actions.

Sources