arXiv/EACL — PHISH: persona jailbreaking via implicit steering in chat history
• Category: Research
- Paper: “Persona Jailbreaking in Large Language Models” (accepted at EACL 2026 Findings).
- Claim: you can predictably shift an LLM’s induced persona using adversarial conversational history (black-box / inference-only) — without needing system prompt access.
- Technique: the authors propose PHISH (Persona Hijacking via Implicit Steering in History): semantically loaded cues embedded into user-side turns to gradually induce “reverse personas.”
- Multi-turn amplification: the effect reportedly strengthens over longer conversations (which mirrors real deployments in support/tutoring/therapy-like apps).
- Collateral drift: persona shifts also trigger changes in correlated traits (so “just” manipulating tone can bleed into policy adherence, risk posture, etc.).
- High-risk domains tested: the paper highlights mental health, tutoring, and customer support settings.
- Guardrails aren’t enough: authors say current defenses offer partial protection but are brittle under sustained, subtle steering.
Why it matters
- Persona is a security boundary now: if your product relies on a “safe, compliant, calm” assistant persona, the chat history itself becomes an attack surface.
- Long-lived sessions are risky: the more context you retain, the more an attacker can shape it — which is exactly what many “agentic” apps want for UX.
- Hard-to-detect failure mode: gradual drift can look like natural variation until the assistant crosses a threshold at the worst time (sensitive advice, approvals, escalations).
What to do
- Decide what “persona invariants” are: write explicit requirements (e.g., never provide self-harm instructions; always recommend a human escalation path; never claim to be a licensed professional).
- Add drift detection: periodically re-anchor the assistant with short “self-check” probes or classifiers to detect when the assistant’s behavior diverges from intended persona.
- Constrain memory: don’t blindly persist everything; segment user content vs. operator instructions; consider expiring or summarizing older turns with a safety-preserving pipeline.
- Test for history-based steering: add regression suites that attempt subtle multi-turn manipulation (not just single-turn jailbreak prompts).
- Operationalize escalation: when drift is detected, degrade capabilities (fewer tools, no actions) and route to human review.
Sources
- arXiv (primary): Persona Jailbreaking in Large Language Models
- Code/dataset (primary): Jivnesh/PHISH