arXiv — Jailbreaking LLMs & VLMs: mechanisms and unified defenses

2026-03-02 Research by al-ice.ai Editorial

AI relevance: The survey maps jailbreak attack/defense mechanics across text and multimodal models, giving AI security teams a unified playbook for testing and mitigation.

It frames jailbreaks as structural vulnerabilities tied to incomplete training data, linguistic ambiguity, and generative uncertainty.
The authors separate hallucinations vs. jailbreaks by intent and triggering mechanism, avoiding conflating reliability with adversarial control.
Attack taxonomy spans template/encoding tricks, in-context manipulation, RL/adversarial training attacks, and LLM-assisted/fine-tuned jailbreaks.
For VLMs, the survey adds image-level perturbations and agent-based transfer attacks beyond text prompts.
Defense taxonomy clusters into prompt obfuscation, output evaluation, and model-level alignment/fine-tuning strategies.
Evaluation metrics include attack success rate, toxicity, query/time cost, plus multimodal clean accuracy and attribute success rate.
The proposed unified defense stacks variant-consistency and gradient-sensitivity checks (perception) with safety-aware decoding (generation).
It calls for automated red-teaming, cross-modal defense, and standardized benchmarks to make results comparable.

Why it matters

Security teams need a single framework that applies to both text agents and multimodal assistants.
Without standardized metrics, jailbreak defenses are hard to compare or validate in production.
Multimodal jailbreaks expand the attack surface for agents that ingest screenshots, video, or camera feeds.

What to do

Adopt layered testing: evaluate prompt, output, and model-layer defenses separately.
Track metrics consistently: monitor ASR, toxicity, and cost to compare defenses over time.
Include VLM scenarios: add image/vision prompts to red-team suites if your agents see pixels.

arXiv — Jailbreaking LLMs & VLMs: mechanisms and unified defenses

Why it matters

What to do

Sources