Trend Micro — Sockpuppeting: Single-Line Jailbreak for 11 Major AI Models

AI relevance: This vulnerability affects core AI safety mechanisms across major commercial and open-source LLMs, exposing fundamental weaknesses in how AI systems handle message validation and prefill features.

Trend Micro researchers have disclosed a novel jailbreak technique called "sockpuppeting" that allows attackers to bypass safety guardrails in 11 major large language models using just a single line of code.

Key Findings

  • Single-line exploit: Attackers abuse assistant prefill APIs to inject compliant response prefixes
  • Widespread impact: Affects 11 major LLMs including OpenAI ChatGPT, Anthropic Claude, Google Gemini, Meta Llama, and Mistral models
  • High success rates: Gemini 2.5 Flash most vulnerable at 15.7% success rate
  • Black-box technique: Requires no optimization and no access to model weights
  • API-level vulnerability: Exploits legitimate prefill features meant for response formatting

How Sockpuppeting Works

The attack exploits assistant prefill, a legitimate API feature that developers use to force specific response formats. Attackers abuse this by injecting a compliant prefix (e.g., "Sure, here is how to do it,") directly into the assistant's role message, tricking the model into continuing prohibited content.

Affected Platforms

  • Self-hosted inference servers: Ollama, vLLM, and other platforms that don't enforce message validation by default
  • Commercial APIs: OpenAI, Anthropic, Google, and other major providers
  • Open-source models: Llama, Mistral, and other popular open-weight models

Why This Matters

Sockpuppeting represents a significant shift in AI security threats:

  • Moves beyond traditional prompt injection to API-level exploitation
  • Demonstrates how legitimate features can be weaponized
  • Highlights the need for better message validation in AI infrastructure
  • Shows that black-box attacks can be highly effective without model access

What to Do

  • API providers: Implement strict message validation and role checking
  • Self-hosted deployments: Manually enforce message ordering and validation
  • Developers: Audit prefill usage and implement input sanitization
  • Security teams: Add sockpuppeting to AI red teaming exercises

Sources