vLLM — Speculative decoding cache poisoning (CVE-2026-3108)

AI relevance: A core vulnerability in the vLLM inference engine's performance optimization layer allows attackers to leak data from concurrent user sessions via KV cache pollution.

  • CVE-2026-3108 affects vLLM deployments using speculative decoding.
  • The vulnerability stems from improper isolation of the Key-Value (KV) cache when multiple request sequences share prefix tokens.
  • Attackers can craft specific prompts to "poison" the shared cache, leading the draft model to leak snippets of other users' session data into the speculative verification step.
  • This represents a significant risk for multi-tenant AI providers and enterprise RAG deployments.
  • The issue is addressed in vLLM v0.7.2 and later.

Why it matters

  • Performance optimizations like speculative decoding are critical for scaling LLM inference but introduce complex side-channel risks.
  • This CVE highlights the difficulty of maintaining "hard" security boundaries in dynamic, GPU-memory-constrained environments where cache reuse is a primary efficiency driver.

What to do

  • Upgrade: Immediately update to vLLM v0.7.2+.
  • Disable Speculation: If an upgrade isn't immediately possible, disable speculative decoding (`--use-v2-block-manager` may offer partial mitigation, but explicit disabling is safer).
  • Audit Multi-tenancy: Ensure your orchestration layer (e.g., Triton, Ray) is not implicitly sharing inference state across different security contexts.

Sources