arXiv — Jailbreaking LLMs & VLMs: mechanisms and unified defenses
AI relevance: The survey maps jailbreak attack/defense mechanics across text and multimodal models, giving AI security teams a unified playbook for testing and mitigation.
- It frames jailbreaks as structural vulnerabilities tied to incomplete training data, linguistic ambiguity, and generative uncertainty.
- The authors separate hallucinations vs. jailbreaks by intent and triggering mechanism, avoiding conflating reliability with adversarial control.
- Attack taxonomy spans template/encoding tricks, in-context manipulation, RL/adversarial training attacks, and LLM-assisted/fine-tuned jailbreaks.
- For VLMs, the survey adds image-level perturbations and agent-based transfer attacks beyond text prompts.
- Defense taxonomy clusters into prompt obfuscation, output evaluation, and model-level alignment/fine-tuning strategies.
- Evaluation metrics include attack success rate, toxicity, query/time cost, plus multimodal clean accuracy and attribute success rate.
- The proposed unified defense stacks variant-consistency and gradient-sensitivity checks (perception) with safety-aware decoding (generation).
- It calls for automated red-teaming, cross-modal defense, and standardized benchmarks to make results comparable.
Why it matters
- Security teams need a single framework that applies to both text agents and multimodal assistants.
- Without standardized metrics, jailbreak defenses are hard to compare or validate in production.
- Multimodal jailbreaks expand the attack surface for agents that ingest screenshots, video, or camera feeds.
What to do
- Adopt layered testing: evaluate prompt, output, and model-layer defenses separately.
- Track metrics consistently: monitor ASR, toxicity, and cost to compare defenses over time.
- Include VLM scenarios: add image/vision prompts to red-team suites if your agents see pixels.