Open-Source LLMs Vulnerable to Long Reasoning Multi-Turn Jailbreaks

AI relevance: Enterprises deploying open-source models in agentic or multi-turn workflows face a documented, quantified attack surface that lightweight guardrails do not close, meaning production agent deployments may be more exposed than security reviews assume.

Key findings

  • 10 open-source models tested across six major families: Phi (Microsoft), Mistral, DeepSeek-R1, Llama 3.2 (Meta), Qwen (Alibaba), and Gemma (Google).
  • 167 attack scenarios: 94 prompt-injection tests and 73 jailbreak tests — one of the broadest systematic evaluations published for open-source model robustness.
  • Zero models achieved reliable resistance to multi-turn reasoning jailbreaks, even after lightweight defenses were applied.
  • Attack class: Gradual context-shifting across multiple conversational turns — exploiting how models maintain and update reasoning state over extended exchanges, rather than single adversarial prompts.
  • Inverse scaling relationship: Attack difficulty increased with model capability, meaning stronger models were harder but not impossible to break — creating a false sense of safety as organizations upgrade to more powerful open-source models.
  • Defenses helped but didn't solve: Lightweight defenses measurably reduced attack success rates across models, but none eliminated vulnerability to multi-turn reasoning manipulation.
  • Structural weakness shared across vendors: The finding implicates Meta, Microsoft, Mistral AI, DeepSeek, Alibaba, and Google as all shipping models with the same underlying vulnerability class.
  • Enterprise impact: Organizations running DeepSeek-R1 or Llama 3.2 in customer-facing or agentic pipelines could face exploitation via gradual context-shifting attacks before effective defenses are standardized.
  • EU AI Act implications: The 167-scenario benchmark could be adopted as an evaluation standard, creating regulatory pressure on model providers to demonstrate multi-turn robustness.
  • Red-teaming gap identified: Security auditors relying on single-turn red-teaming to clear open-source deployments are systematically underestimating risk.

Why it matters

The research reframes the safety gap as a fundamental challenge in how long-context reasoning models maintain alignment across extended, adversarially steered conversations — not just a content-filter problem. For organizations running open-source models in agent pipelines, this means current defense assumptions may be systematically wrong. The inverse scaling relationship is particularly concerning: the more capable the model, the more it appears safe while remaining vulnerable.

What to do

  • Extend red-team evaluations to multi-turn long-reasoning attack scenarios — single-turn tests are insufficient.
  • Apply the 167-scenario benchmark as a baseline when evaluating open-source models for production deployment.
  • Monitor for emerging multi-turn defense vendors (Robust Intelligence, HiddenLayer, Lakera) and validate their claims against this benchmark.
  • For agent deployments: implement conversation-length limits and session-level state monitoring to detect gradual context manipulation patterns.

Sources