arXiv — Jailbreaking leaves a trace via latent representations
• Category: Research
- The paper studies jailbreaks by inspecting internal representations rather than only prompt-level defenses.
- Authors analyze layer-wise activations across GPT-J, LLaMA, Mistral, and Mamba to find consistent latent patterns.
- They propose a tensor-based latent representation framework for lightweight jailbreak detection.
- The method does not require fine-tuning or auxiliary LLM-based detectors.
- Latent signals can be used to disrupt jailbreak execution at inference time.
- On an abliterated LLaMA-3.1-8B model, selectively bypassing high-susceptibility layers blocked 78% of jailbreak attempts while preserving 94% benign behavior.
Why it matters
- Prompt filters alone miss adversarial strategies that exploit model internals.
- Inference-time controls could provide a scalable defense for open-weight deployments.
- The work suggests jailbreak signals are detectable in latent space across model families.
What to do
- Evaluate internal-activation defenses if you operate open-weight models.
- Test across attack distributions to confirm robustness beyond a single benchmark.
- Combine latent checks with policy gating for higher-risk actions.