arXiv — Jailbreaking LLMs & VLMs: mechanisms and unified defenses

AI relevance: The survey maps jailbreak attack/defense mechanics across text and multimodal models, giving AI security teams a unified playbook for testing and mitigation.

  • It frames jailbreaks as structural vulnerabilities tied to incomplete training data, linguistic ambiguity, and generative uncertainty.
  • The authors separate hallucinations vs. jailbreaks by intent and triggering mechanism, avoiding conflating reliability with adversarial control.
  • Attack taxonomy spans template/encoding tricks, in-context manipulation, RL/adversarial training attacks, and LLM-assisted/fine-tuned jailbreaks.
  • For VLMs, the survey adds image-level perturbations and agent-based transfer attacks beyond text prompts.
  • Defense taxonomy clusters into prompt obfuscation, output evaluation, and model-level alignment/fine-tuning strategies.
  • Evaluation metrics include attack success rate, toxicity, query/time cost, plus multimodal clean accuracy and attribute success rate.
  • The proposed unified defense stacks variant-consistency and gradient-sensitivity checks (perception) with safety-aware decoding (generation).
  • It calls for automated red-teaming, cross-modal defense, and standardized benchmarks to make results comparable.

Why it matters

  • Security teams need a single framework that applies to both text agents and multimodal assistants.
  • Without standardized metrics, jailbreak defenses are hard to compare or validate in production.
  • Multimodal jailbreaks expand the attack surface for agents that ingest screenshots, video, or camera feeds.

What to do

  • Adopt layered testing: evaluate prompt, output, and model-layer defenses separately.
  • Track metrics consistently: monitor ASR, toxicity, and cost to compare defenses over time.
  • Include VLM scenarios: add image/vision prompts to red-team suites if your agents see pixels.

Sources