vLLM — Speculative decoding cache poisoning (CVE-2026-3108)
AI relevance: A core vulnerability in the vLLM inference engine's performance optimization layer allows attackers to leak data from concurrent user sessions via KV cache pollution.
- CVE-2026-3108 affects vLLM deployments using speculative decoding.
- The vulnerability stems from improper isolation of the Key-Value (KV) cache when multiple request sequences share prefix tokens.
- Attackers can craft specific prompts to "poison" the shared cache, leading the draft model to leak snippets of other users' session data into the speculative verification step.
- This represents a significant risk for multi-tenant AI providers and enterprise RAG deployments.
- The issue is addressed in vLLM v0.7.2 and later.
Why it matters
- Performance optimizations like speculative decoding are critical for scaling LLM inference but introduce complex side-channel risks.
- This CVE highlights the difficulty of maintaining "hard" security boundaries in dynamic, GPU-memory-constrained environments where cache reuse is a primary efficiency driver.
What to do
- Upgrade: Immediately update to vLLM v0.7.2+.
- Disable Speculation: If an upgrade isn't immediately possible, disable speculative decoding (`--use-v2-block-manager` may offer partial mitigation, but explicit disabling is safer).
- Audit Multi-tenancy: Ensure your orchestration layer (e.g., Triton, Ray) is not implicitly sharing inference state across different security contexts.