Adversa AI — IICL Bypasses GPT-5.4 Safety While GPT-5 Remains Immune

Researchers at Adversa AI published results from 3,500+ controlled probes showing that GPT-5.4 is vulnerable to a novel safety bypass technique called Involuntary In-Context Learning (IICL), achieving a 60% attacker success rate — while GPT-5 and GPT-5-mini remain at 0%.

How IICL works

Unlike conventional jailbreaks that rely on role-play, encoding tricks, or multi-turn escalation, IICL targets the structural layer of safety alignment rather than the content layer:

  • The attack provides just 10 examples and 2 structural keywords (answer and is_valid) — no ciphers, no persona prompts, no obfuscation.
  • Harmful content sits in plain text, fully visible to the model.
  • IICL exploits the tension between a model's in-context learning mechanism and its safety training — pattern completion wins because the model processes task structure before content semantics.
  • This is a different attack surface from most red-teaming programs, which test the linguistic layer (personas, encodings) rather than the structural layer.

Key findings

  • 60% ASR on GPT-5.4 under optimal configuration — keylogger generation, molotov instructions, and phishing content all bypassed.
  • 0% ASR on GPT-5, GPT-5-mini, GPT-5-pro, GPT-5.2, GPT-5.2-pro, and GPT-5.4-pro. The vulnerability was introduced in updates after GPT-5.
  • 632 average words per bypassed response — the model produces substantial harmful output, not just token-level slips.
  • The "pro" variants across all generations remain immune, suggesting that enhanced safety training or different alignment approaches close this structural gap.

Why it matters

  • Newer ≠ safer. GPT-5.4 — the newest standard model — is less resistant to this attack class than the model it replaced. Teams that upgraded from GPT-5 to GPT-5.4 may have introduced a safety regression without knowing it.
  • Structural attacks bypass standard red-teaming. Most safety evaluations focus on linguistic jailbreaks. IICL demonstrates that attacks targeting the model's in-context learning pipeline operate on a different attack surface that routine testing misses.
  • Continuous model-level red-teaming is essential. Each model update — even minor version bumps — should be re-tested against structural attack classes, not just the ones that were previously patched.

What to do

  • If you use GPT-5.4 via API, consider testing your specific prompts against IICL-style patterns. The attack requires only standard API access — no model weights or internal access.
  • Add structural-pattern detection to your guardrail stack. Look for prompt patterns that set up example-answer structures with is_valid-type framing, which may signal IICL-style attempts.
  • Prefer "pro" variants for safety-critical applications. The data shows pro models across all generations resist this attack class entirely.
  • Run continuous red-teaming against each model version you deploy. Don't assume safety properties carry forward from one version to the next.

Sources