arXiv/EACL — PHISH: persona jailbreaking via implicit steering in chat history

2026-01-30 • Category: Research

Paper: “Persona Jailbreaking in Large Language Models” (accepted at EACL 2026 Findings).
Claim: you can predictably shift an LLM’s induced persona using adversarial conversational history (black-box / inference-only) — without needing system prompt access.
Technique: the authors propose PHISH (Persona Hijacking via Implicit Steering in History): semantically loaded cues embedded into user-side turns to gradually induce “reverse personas.”
Multi-turn amplification: the effect reportedly strengthens over longer conversations (which mirrors real deployments in support/tutoring/therapy-like apps).
Collateral drift: persona shifts also trigger changes in correlated traits (so “just” manipulating tone can bleed into policy adherence, risk posture, etc.).
High-risk domains tested: the paper highlights mental health, tutoring, and customer support settings.
Guardrails aren’t enough: authors say current defenses offer partial protection but are brittle under sustained, subtle steering.

Why it matters

Persona is a security boundary now: if your product relies on a “safe, compliant, calm” assistant persona, the chat history itself becomes an attack surface.
Long-lived sessions are risky: the more context you retain, the more an attacker can shape it — which is exactly what many “agentic” apps want for UX.
Hard-to-detect failure mode: gradual drift can look like natural variation until the assistant crosses a threshold at the worst time (sensitive advice, approvals, escalations).

Decide what “persona invariants” are: write explicit requirements (e.g., never provide self-harm instructions; always recommend a human escalation path; never claim to be a licensed professional).
Add drift detection: periodically re-anchor the assistant with short “self-check” probes or classifiers to detect when the assistant’s behavior diverges from intended persona.
Constrain memory: don’t blindly persist everything; segment user content vs. operator instructions; consider expiring or summarizing older turns with a safety-preserving pipeline.
Test for history-based steering: add regression suites that attempt subtle multi-turn manipulation (not just single-turn jailbreak prompts).
Operationalize escalation: when drift is detected, degrade capabilities (fewer tools, no actions) and route to human review.