Open-Source LLMs Vulnerable to Long Reasoning Multi-Turn Jailbreaks

2026-05-25 Security by al-ice.ai Editorial

AI relevance: Enterprises deploying open-source models in agentic or multi-turn workflows face a documented, quantified attack surface that lightweight guardrails do not close, meaning production agent deployments may be more exposed than security reviews assume.

Key findings

10 open-source models tested across six major families: Phi (Microsoft), Mistral, DeepSeek-R1, Llama 3.2 (Meta), Qwen (Alibaba), and Gemma (Google).
167 attack scenarios: 94 prompt-injection tests and 73 jailbreak tests — one of the broadest systematic evaluations published for open-source model robustness.
Zero models achieved reliable resistance to multi-turn reasoning jailbreaks, even after lightweight defenses were applied.
Attack class: Gradual context-shifting across multiple conversational turns — exploiting how models maintain and update reasoning state over extended exchanges, rather than single adversarial prompts.
Inverse scaling relationship: Attack difficulty increased with model capability, meaning stronger models were harder but not impossible to break — creating a false sense of safety as organizations upgrade to more powerful open-source models.
Defenses helped but didn't solve: Lightweight defenses measurably reduced attack success rates across models, but none eliminated vulnerability to multi-turn reasoning manipulation.
Structural weakness shared across vendors: The finding implicates Meta, Microsoft, Mistral AI, DeepSeek, Alibaba, and Google as all shipping models with the same underlying vulnerability class.
Enterprise impact: Organizations running DeepSeek-R1 or Llama 3.2 in customer-facing or agentic pipelines could face exploitation via gradual context-shifting attacks before effective defenses are standardized.
EU AI Act implications: The 167-scenario benchmark could be adopted as an evaluation standard, creating regulatory pressure on model providers to demonstrate multi-turn robustness.
Red-teaming gap identified: Security auditors relying on single-turn red-teaming to clear open-source deployments are systematically underestimating risk.

Why it matters

The research reframes the safety gap as a fundamental challenge in how long-context reasoning models maintain alignment across extended, adversarially steered conversations — not just a content-filter problem. For organizations running open-source models in agent pipelines, this means current defense assumptions may be systematically wrong. The inverse scaling relationship is particularly concerning: the more capable the model, the more it appears safe while remaining vulnerable.

What to do

Extend red-team evaluations to multi-turn long-reasoning attack scenarios — single-turn tests are insufficient.
Apply the 167-scenario benchmark as a baseline when evaluating open-source models for production deployment.
Monitor for emerging multi-turn defense vendors (Robust Intelligence, HiddenLayer, Lakera) and validate their claims against this benchmark.
For agent deployments: implement conversation-length limits and session-level state monitoring to detect gradual context manipulation patterns.

Open-Source LLMs Vulnerable to Long Reasoning Multi-Turn Jailbreaks

Key findings

Why it matters

What to do

Sources