CVE-2026-53923 — vLLM GGUF Dequantization Bug Leaks GPU Memory Between Tenants
AI relevance: Multi-tenant vLLM inference clusters risk cross-tenant GPU memory leakage when a GGUF dequantization kernel truncates tensor dimensions and overflows a device buffer.
What Happened
- CVE-2026-53923 affects vLLM versions 0.5.5 through 0.23.1rc0 — a wide range covering most production GGUF-serving deployments.
- The bug sits in the GGUF dequantization kernels: tensor dimension values are integer-truncated before being used to size GPU buffer allocations.
- A crafted GGUF model with inflated dimension metadata causes the kernel to allocate a smaller buffer than the dequantization write requires, producing a heap-based buffer overflow on the GPU.
- The overflow exposes uninitialized GPU memory from prior operations — which in a multi-tenant setup may contain embeddings, prompt activations, or API tokens from other users' inference requests.
- No remote code execution on the host CPU has been demonstrated through this path alone, but the information leak is the primary risk in shared inference clusters.
- vLLM patched the issue in version 0.23.1rc0 by widening the dimension type before allocation sizing.
Why It Matters
vLLM is the dominant open-source inference engine for serving open-weight models at scale. Many providers run it in multi-tenant configurations — sharing a single GPU across multiple customers or workloads. The GGUF format is the standard for quantized model distribution (llama.cpp ecosystem). A malicious model publisher can craft a GGUF file that triggers the overflow on any vulnerable vLLM instance that loads it. The leaked memory may contain fragments of other tenants' prompts, completions, or loaded model weights — a direct confidentiality breach in shared inference infrastructure.
What to Do
- Upgrade immediately to vLLM ≥ 0.23.1rc0. This is the only fix.
- Audit GGUF model sources — only load models from trusted publishers with verified checksums. Treat any untrusted GGUF file as potentially weaponized.
- Isolate tenants — if you run multi-tenant vLLM, enforce per-tenant GPU partitioning (MIG, MPS, or separate containers) to limit blast radius.
- Monitor GPU error logs for dequantization failures, OOM errors on unexpectedly small allocations, or CUDA memory violations — these may indicate exploitation attempts.