Adversa AI — IICL Attack Bypasses GPT-5.4 Safety at 60% Success Rate

AI relevance: A new jailbreak technique targets the structural tension between in-context learning and safety alignment — showing that upgrading from GPT-5 to GPT-5.4 may have reduced safety robustness, not improved it.

Adversa AI researchers published results from 3,500+ controlled probes across the GPT-4 through GPT-5.4 model family, introducing a novel attack technique called Involuntary In-Context Learning (IICL). The method achieved a 60% bypass success rate against GPT-5.4 while GPT-5 and GPT-5-mini remained fully resistant at 0%.

  • The attack uses just 10 examples and 2 trigger words ("answer" and "is_valid") — no encoding, ciphers, or role-play required.
  • Harmful content is presented in plain text; the model isn't being "tricked" linguistically but is responding to structural pattern matching.
  • IICL exploits the tension between a model's in-context learning mechanism and its safety training — these are learned behaviors operating at different computational layers.
  • The vulnerability appears to have been introduced in post-GPT-5 updates, representing a safety regression in the newer model.
  • 7 controlled experiments across 10 models confirmed the finding: newer does not automatically mean safer.
  • The technique targets the structural layer of safety alignment, not the content layer — a different attack surface than most red-teaming programs evaluate.

Why it matters

Most AI safety evaluation focuses on linguistic jailbreaks — personas, encoding tricks, multi-turn escalation. IICL demonstrates that safety regression can occur at the architectural level when model updates shift the balance between in-context learning and safety training. Organizations that upgraded to GPT-5.4 assuming improved safety may have inadvertently expanded their exposure.

What to do

  • If your product uses GPT-5.4, run targeted IICL-style probes against your deployment to assess exposure.
  • Implement output validation layers that operate independently of the model's own safety filtering.
  • Treat model upgrades as potential safety regressions — test new versions against your existing jailbreak test suite before promoting to production.
  • Monitor Adversa AI's full research for reproducible methodology.

Sources