arXiv — Jailbreaking leaves a trace via latent representations

2026-02-17 • Category: Research

The paper studies jailbreaks by inspecting internal representations rather than only prompt-level defenses.
Authors analyze layer-wise activations across GPT-J, LLaMA, Mistral, and Mamba to find consistent latent patterns.
They propose a tensor-based latent representation framework for lightweight jailbreak detection.
The method does not require fine-tuning or auxiliary LLM-based detectors.
Latent signals can be used to disrupt jailbreak execution at inference time.
On an abliterated LLaMA-3.1-8B model, selectively bypassing high-susceptibility layers blocked 78% of jailbreak attempts while preserving 94% benign behavior.

Why it matters

Prompt filters alone miss adversarial strategies that exploit model internals.
Inference-time controls could provide a scalable defense for open-weight deployments.
The work suggests jailbreak signals are detectable in latent space across model families.

Evaluate internal-activation defenses if you operate open-weight models.
Test across attack distributions to confirm robustness beyond a single benchmark.
Combine latent checks with policy gating for higher-risk actions.