arXiv — Morality Attacks Jailbreak Both LLMs and Guardrail Models

AI relevance: A new jailbreak class targets the moral-value alignment layer of LLMs rather than syntactic prompt tricks, showing that both frontier models and guardrail systems used in production deployments are vulnerable to moral-framing manipulation.

  • Ying Su et al. published "Jailbreaking Large Language Models with Morality Attacks" (arXiv:2604.17053), accepted at ACL 2026 Findings.
  • The researchers constructed a 10,300-instance morality dataset split across two attack surfaces: Value Ambiguity (situations where moral judgment is contextually unclear) and Value Conflict (situations where competing moral values produce contradictory alignment signals).
  • Four adversarial attack strategies were formalized to manipulate how LLMs render moral judgments on ambiguous or conflicting questions.
  • The attacks were evaluated against both frontier LLMs and guardrail models — the latter being the safety filters deployed in production generative systems with flexible user input.
  • Results show a "critical vulnerability" in both model classes to these moral-aware attacks, meaning guardrails alone do not protect against this jailbreak class.
  • The paper positions this as a research tool for studying LLM internal pluralistic values, but the attack mechanics are directly applicable as a jailbreak method against aligned systems.

Why it matters

Most jailbreak research focuses on syntactic tricks — roleplay framing, encoding obfuscation, or delimiter injection. Morality attacks target the semantic alignment layer: the very values that post-training alignment (SFT, RLHF, DPO) instills in the model. If a model's moral reasoning can be manipulated through carefully constructed ambiguous or conflicting scenarios, then guardrails built on the same alignment foundation inherit the same weakness. This is particularly relevant for systems that use separate guardrail models as their safety layer — the paper shows these guardrails are themselves susceptible.

What to do

  • Test guardrail models independently. Don't assume a separate guardrail layer protects against all jailbreak classes. Red-team guardrails with moral-framing and value-ambiguity inputs, not just known prompt-injection patterns.
  • Evaluate pluralism alignment robustness. If your application serves diverse user populations across cultural contexts, moral-value attacks may be more effective than traditional jailbreaks. Include value-ambiguity scenarios in your safety evaluation pipeline.
  • Consider defense-in-depth. Layered safety — combining input filtering, output validation, and behavioral monitoring — is more resilient than relying on any single alignment or guardrail mechanism.

Sources