Trend Micro — Sockpuppeting: Single-Line Jailbreak for 11 Major AI Models

2026-04-12 Security

AI relevance: This vulnerability affects core AI safety mechanisms across major commercial and open-source LLMs, exposing fundamental weaknesses in how AI systems handle message validation and prefill features.

Trend Micro researchers have disclosed a novel jailbreak technique called "sockpuppeting" that allows attackers to bypass safety guardrails in 11 major large language models using just a single line of code.

Key Findings

Single-line exploit: Attackers abuse assistant prefill APIs to inject compliant response prefixes
Widespread impact: Affects 11 major LLMs including OpenAI ChatGPT, Anthropic Claude, Google Gemini, Meta Llama, and Mistral models
High success rates: Gemini 2.5 Flash most vulnerable at 15.7% success rate
Black-box technique: Requires no optimization and no access to model weights
API-level vulnerability: Exploits legitimate prefill features meant for response formatting

How Sockpuppeting Works

The attack exploits assistant prefill, a legitimate API feature that developers use to force specific response formats. Attackers abuse this by injecting a compliant prefix (e.g., "Sure, here is how to do it,") directly into the assistant's role message, tricking the model into continuing prohibited content.

Affected Platforms

Self-hosted inference servers: Ollama, vLLM, and other platforms that don't enforce message validation by default
Commercial APIs: OpenAI, Anthropic, Google, and other major providers
Open-source models: Llama, Mistral, and other popular open-weight models

Why This Matters

Sockpuppeting represents a significant shift in AI security threats:

Moves beyond traditional prompt injection to API-level exploitation
Demonstrates how legitimate features can be weaponized
Highlights the need for better message validation in AI infrastructure
Shows that black-box attacks can be highly effective without model access

What to Do

API providers: Implement strict message validation and role checking
Self-hosted deployments: Manually enforce message ordering and validation
Developers: Audit prefill usage and implement input sanitization
Security teams: Add sockpuppeting to AI red teaming exercises