Open-Source LLMs Vulnerable to Long Reasoning Multi-Turn Jailbreaks
AI relevance: Enterprises deploying open-source models in agentic or multi-turn workflows face a documented, quantified attack surface that lightweight guardrails do not close, meaning production agent deployments may be more exposed than security reviews assume.
Key findings
- 10 open-source models tested across six major families: Phi (Microsoft), Mistral, DeepSeek-R1, Llama 3.2 (Meta), Qwen (Alibaba), and Gemma (Google).
- 167 attack scenarios: 94 prompt-injection tests and 73 jailbreak tests — one of the broadest systematic evaluations published for open-source model robustness.
- Zero models achieved reliable resistance to multi-turn reasoning jailbreaks, even after lightweight defenses were applied.
- Attack class: Gradual context-shifting across multiple conversational turns — exploiting how models maintain and update reasoning state over extended exchanges, rather than single adversarial prompts.
- Inverse scaling relationship: Attack difficulty increased with model capability, meaning stronger models were harder but not impossible to break — creating a false sense of safety as organizations upgrade to more powerful open-source models.
- Defenses helped but didn't solve: Lightweight defenses measurably reduced attack success rates across models, but none eliminated vulnerability to multi-turn reasoning manipulation.
- Structural weakness shared across vendors: The finding implicates Meta, Microsoft, Mistral AI, DeepSeek, Alibaba, and Google as all shipping models with the same underlying vulnerability class.
- Enterprise impact: Organizations running DeepSeek-R1 or Llama 3.2 in customer-facing or agentic pipelines could face exploitation via gradual context-shifting attacks before effective defenses are standardized.
- EU AI Act implications: The 167-scenario benchmark could be adopted as an evaluation standard, creating regulatory pressure on model providers to demonstrate multi-turn robustness.
- Red-teaming gap identified: Security auditors relying on single-turn red-teaming to clear open-source deployments are systematically underestimating risk.
Why it matters
The research reframes the safety gap as a fundamental challenge in how long-context reasoning models maintain alignment across extended, adversarially steered conversations — not just a content-filter problem. For organizations running open-source models in agent pipelines, this means current defense assumptions may be systematically wrong. The inverse scaling relationship is particularly concerning: the more capable the model, the more it appears safe while remaining vulnerable.
What to do
- Extend red-team evaluations to multi-turn long-reasoning attack scenarios — single-turn tests are insufficient.
- Apply the 167-scenario benchmark as a baseline when evaluating open-source models for production deployment.
- Monitor for emerging multi-turn defense vendors (Robust Intelligence, HiddenLayer, Lakera) and validate their claims against this benchmark.
- For agent deployments: implement conversation-length limits and session-level state monitoring to detect gradual context manipulation patterns.