Trend Micro — Sockpuppeting: Single-Line Jailbreak for 11 Major AI Models
AI relevance: This vulnerability affects core AI safety mechanisms across major commercial and open-source LLMs, exposing fundamental weaknesses in how AI systems handle message validation and prefill features.
Trend Micro researchers have disclosed a novel jailbreak technique called "sockpuppeting" that allows attackers to bypass safety guardrails in 11 major large language models using just a single line of code.
Key Findings
- Single-line exploit: Attackers abuse assistant prefill APIs to inject compliant response prefixes
- Widespread impact: Affects 11 major LLMs including OpenAI ChatGPT, Anthropic Claude, Google Gemini, Meta Llama, and Mistral models
- High success rates: Gemini 2.5 Flash most vulnerable at 15.7% success rate
- Black-box technique: Requires no optimization and no access to model weights
- API-level vulnerability: Exploits legitimate prefill features meant for response formatting
How Sockpuppeting Works
The attack exploits assistant prefill, a legitimate API feature that developers use to force specific response formats. Attackers abuse this by injecting a compliant prefix (e.g., "Sure, here is how to do it,") directly into the assistant's role message, tricking the model into continuing prohibited content.
Affected Platforms
- Self-hosted inference servers: Ollama, vLLM, and other platforms that don't enforce message validation by default
- Commercial APIs: OpenAI, Anthropic, Google, and other major providers
- Open-source models: Llama, Mistral, and other popular open-weight models
Why This Matters
Sockpuppeting represents a significant shift in AI security threats:
- Moves beyond traditional prompt injection to API-level exploitation
- Demonstrates how legitimate features can be weaponized
- Highlights the need for better message validation in AI infrastructure
- Shows that black-box attacks can be highly effective without model access
What to Do
- API providers: Implement strict message validation and role checking
- Self-hosted deployments: Manually enforce message ordering and validation
- Developers: Audit prefill usage and implement input sanitization
- Security teams: Add sockpuppeting to AI red teaming exercises