Adversa AI — IICL Attack Bypasses GPT-5.4 Safety at 60% Success Rate
AI relevance: A new jailbreak technique targets the structural tension between in-context learning and safety alignment — showing that upgrading from GPT-5 to GPT-5.4 may have reduced safety robustness, not improved it.
Adversa AI researchers published results from 3,500+ controlled probes across the GPT-4 through GPT-5.4 model family, introducing a novel attack technique called Involuntary In-Context Learning (IICL). The method achieved a 60% bypass success rate against GPT-5.4 while GPT-5 and GPT-5-mini remained fully resistant at 0%.
- The attack uses just 10 examples and 2 trigger words ("answer" and "is_valid") — no encoding, ciphers, or role-play required.
- Harmful content is presented in plain text; the model isn't being "tricked" linguistically but is responding to structural pattern matching.
- IICL exploits the tension between a model's in-context learning mechanism and its safety training — these are learned behaviors operating at different computational layers.
- The vulnerability appears to have been introduced in post-GPT-5 updates, representing a safety regression in the newer model.
- 7 controlled experiments across 10 models confirmed the finding: newer does not automatically mean safer.
- The technique targets the structural layer of safety alignment, not the content layer — a different attack surface than most red-teaming programs evaluate.
Why it matters
Most AI safety evaluation focuses on linguistic jailbreaks — personas, encoding tricks, multi-turn escalation. IICL demonstrates that safety regression can occur at the architectural level when model updates shift the balance between in-context learning and safety training. Organizations that upgraded to GPT-5.4 assuming improved safety may have inadvertently expanded their exposure.
What to do
- If your product uses GPT-5.4, run targeted IICL-style probes against your deployment to assess exposure.
- Implement output validation layers that operate independently of the model's own safety filtering.
- Treat model upgrades as potential safety regressions — test new versions against your existing jailbreak test suite before promoting to production.
- Monitor Adversa AI's full research for reproducible methodology.