Adversa AI — IICL Bypasses GPT-5.4 Safety While GPT-5 Remains Immune
Researchers at Adversa AI published results from 3,500+ controlled probes showing that GPT-5.4 is vulnerable to a novel safety bypass technique called Involuntary In-Context Learning (IICL), achieving a 60% attacker success rate — while GPT-5 and GPT-5-mini remain at 0%.
How IICL works
Unlike conventional jailbreaks that rely on role-play, encoding tricks, or multi-turn escalation, IICL targets the structural layer of safety alignment rather than the content layer:
- The attack provides just 10 examples and 2 structural keywords (
answerandis_valid) — no ciphers, no persona prompts, no obfuscation. - Harmful content sits in plain text, fully visible to the model.
- IICL exploits the tension between a model's in-context learning mechanism and its safety training — pattern completion wins because the model processes task structure before content semantics.
- This is a different attack surface from most red-teaming programs, which test the linguistic layer (personas, encodings) rather than the structural layer.
Key findings
- 60% ASR on GPT-5.4 under optimal configuration — keylogger generation, molotov instructions, and phishing content all bypassed.
- 0% ASR on GPT-5, GPT-5-mini, GPT-5-pro, GPT-5.2, GPT-5.2-pro, and GPT-5.4-pro. The vulnerability was introduced in updates after GPT-5.
- 632 average words per bypassed response — the model produces substantial harmful output, not just token-level slips.
- The "pro" variants across all generations remain immune, suggesting that enhanced safety training or different alignment approaches close this structural gap.
Why it matters
- Newer ≠ safer. GPT-5.4 — the newest standard model — is less resistant to this attack class than the model it replaced. Teams that upgraded from GPT-5 to GPT-5.4 may have introduced a safety regression without knowing it.
- Structural attacks bypass standard red-teaming. Most safety evaluations focus on linguistic jailbreaks. IICL demonstrates that attacks targeting the model's in-context learning pipeline operate on a different attack surface that routine testing misses.
- Continuous model-level red-teaming is essential. Each model update — even minor version bumps — should be re-tested against structural attack classes, not just the ones that were previously patched.
What to do
- If you use GPT-5.4 via API, consider testing your specific prompts against IICL-style patterns. The attack requires only standard API access — no model weights or internal access.
- Add structural-pattern detection to your guardrail stack. Look for prompt patterns that set up example-answer structures with
is_valid-type framing, which may signal IICL-style attempts. - Prefer "pro" variants for safety-critical applications. The data shows pro models across all generations resist this attack class entirely.
- Run continuous red-teaming against each model version you deploy. Don't assume safety properties carry forward from one version to the next.