arXiv — Jailbreaking leaves a trace via latent representations

• Category: Research

  • The paper studies jailbreaks by inspecting internal representations rather than only prompt-level defenses.
  • Authors analyze layer-wise activations across GPT-J, LLaMA, Mistral, and Mamba to find consistent latent patterns.
  • They propose a tensor-based latent representation framework for lightweight jailbreak detection.
  • The method does not require fine-tuning or auxiliary LLM-based detectors.
  • Latent signals can be used to disrupt jailbreak execution at inference time.
  • On an abliterated LLaMA-3.1-8B model, selectively bypassing high-susceptibility layers blocked 78% of jailbreak attempts while preserving 94% benign behavior.

Why it matters

  • Prompt filters alone miss adversarial strategies that exploit model internals.
  • Inference-time controls could provide a scalable defense for open-weight deployments.
  • The work suggests jailbreak signals are detectable in latent space across model families.

What to do

  • Evaluate internal-activation defenses if you operate open-weight models.
  • Test across attack distributions to confirm robustness beyond a single benchmark.
  • Combine latent checks with policy gating for higher-risk actions.

Links