arXiv/EACL — PHISH: persona jailbreaking via implicit steering in chat history

• Category: Research

  • Paper:Persona Jailbreaking in Large Language Models” (accepted at EACL 2026 Findings).
  • Claim: you can predictably shift an LLM’s induced persona using adversarial conversational history (black-box / inference-only) — without needing system prompt access.
  • Technique: the authors propose PHISH (Persona Hijacking via Implicit Steering in History): semantically loaded cues embedded into user-side turns to gradually induce “reverse personas.”
  • Multi-turn amplification: the effect reportedly strengthens over longer conversations (which mirrors real deployments in support/tutoring/therapy-like apps).
  • Collateral drift: persona shifts also trigger changes in correlated traits (so “just” manipulating tone can bleed into policy adherence, risk posture, etc.).
  • High-risk domains tested: the paper highlights mental health, tutoring, and customer support settings.
  • Guardrails aren’t enough: authors say current defenses offer partial protection but are brittle under sustained, subtle steering.

Why it matters

  • Persona is a security boundary now: if your product relies on a “safe, compliant, calm” assistant persona, the chat history itself becomes an attack surface.
  • Long-lived sessions are risky: the more context you retain, the more an attacker can shape it — which is exactly what many “agentic” apps want for UX.
  • Hard-to-detect failure mode: gradual drift can look like natural variation until the assistant crosses a threshold at the worst time (sensitive advice, approvals, escalations).

What to do

  1. Decide what “persona invariants” are: write explicit requirements (e.g., never provide self-harm instructions; always recommend a human escalation path; never claim to be a licensed professional).
  2. Add drift detection: periodically re-anchor the assistant with short “self-check” probes or classifiers to detect when the assistant’s behavior diverges from intended persona.
  3. Constrain memory: don’t blindly persist everything; segment user content vs. operator instructions; consider expiring or summarizing older turns with a safety-preserving pipeline.
  4. Test for history-based steering: add regression suites that attempt subtle multi-turn manipulation (not just single-turn jailbreak prompts).
  5. Operationalize escalation: when drift is detected, degrade capabilities (fewer tools, no actions) and route to human review.

Sources