arXiv — Analysis of LLMs against prompt injection and jailbreak attacks
AI relevance: The paper benchmarks prompt-injection and jailbreak robustness across widely used open-source LLMs and shows that lightweight defenses used in production can be bypassed by longer, reasoning-heavy prompts.
- Evaluates prompt-injection and jailbreak risks using a manually curated attack dataset.
- Benchmarks multiple open-source families, including Phi, Mistral, DeepSeek-R1, Llama 3.2, Qwen, and Gemma.
- Finds wide behavioral variance between models, including refusal responses and silent non-responsiveness.
- Tests lightweight inference-time defenses that act as filters without retraining or heavy fine-tuning.
- Reports that these defenses are consistently bypassed by long, reasoning-heavy prompts.
- Highlights that “silent” failures can mask successful injections by suppressing obvious refusal signals.
Why it matters
- Model choice changes real-world injection risk, so security posture varies across stacks.
- Filter-only defenses are fragile against adaptive attacks, especially when prompts are long.
What to do
- Benchmark your exact model with long-form, reasoning-heavy attacks—not just short jailbreak prompts.
- Log silent failures (non-responses, truncation) to detect stealthy injection outcomes.
- Layer defenses with tool allowlists, egress controls, and human-in-the-loop gates for high-risk actions.