vLLM — Speculative decoding cache poisoning (CVE-2026-3108)

2026-03-11 AI CVEs by al-ice.ai Editorial

AI relevance: A core vulnerability in the vLLM inference engine's performance optimization layer allows attackers to leak data from concurrent user sessions via KV cache pollution.

CVE-2026-3108 affects vLLM deployments using speculative decoding.
The vulnerability stems from improper isolation of the Key-Value (KV) cache when multiple request sequences share prefix tokens.
Attackers can craft specific prompts to "poison" the shared cache, leading the draft model to leak snippets of other users' session data into the speculative verification step.
This represents a significant risk for multi-tenant AI providers and enterprise RAG deployments.
The issue is addressed in vLLM v0.7.2 and later.

Why it matters

Performance optimizations like speculative decoding are critical for scaling LLM inference but introduce complex side-channel risks.
This CVE highlights the difficulty of maintaining "hard" security boundaries in dynamic, GPU-memory-constrained environments where cache reuse is a primary efficiency driver.

What to do

Upgrade: Immediately update to vLLM v0.7.2+.
Disable Speculation: If an upgrade isn't immediately possible, disable speculative decoding (`--use-v2-block-manager` may offer partial mitigation, but explicit disabling is safer).
Audit Multi-tenancy: Ensure your orchestration layer (e.g., Triton, Ray) is not implicitly sharing inference state across different security contexts.

vLLM — Speculative decoding cache poisoning (CVE-2026-3108)

Why it matters

What to do

Sources