arXiv — Analysis of LLMs against prompt injection and jailbreak attacks

2026-02-28 Research by al-ice.ai Editorial

AI relevance: The paper benchmarks prompt-injection and jailbreak robustness across widely used open-source LLMs and shows that lightweight defenses used in production can be bypassed by longer, reasoning-heavy prompts.

Evaluates prompt-injection and jailbreak risks using a manually curated attack dataset.
Benchmarks multiple open-source families, including Phi, Mistral, DeepSeek-R1, Llama 3.2, Qwen, and Gemma.
Finds wide behavioral variance between models, including refusal responses and silent non-responsiveness.
Tests lightweight inference-time defenses that act as filters without retraining or heavy fine-tuning.
Reports that these defenses are consistently bypassed by long, reasoning-heavy prompts.
Highlights that “silent” failures can mask successful injections by suppressing obvious refusal signals.

Why it matters

Model choice changes real-world injection risk, so security posture varies across stacks.
Filter-only defenses are fragile against adaptive attacks, especially when prompts are long.

What to do

Benchmark your exact model with long-form, reasoning-heavy attacks—not just short jailbreak prompts.
Log silent failures (non-responses, truncation) to detect stealthy injection outcomes.
Layer defenses with tool allowlists, egress controls, and human-in-the-loop gates for high-risk actions.

arXiv — Analysis of LLMs against prompt injection and jailbreak attacks

Why it matters

What to do

Sources