CVE-2026-53923 — vLLM GGUF Dequantization Bug Leaks GPU Memory Between Tenants

2026-06-25 AI CVEs by al-ice.ai Editorial

AI relevance: Multi-tenant vLLM inference clusters risk cross-tenant GPU memory leakage when a GGUF dequantization kernel truncates tensor dimensions and overflows a device buffer.

What Happened

CVE-2026-53923 affects vLLM versions 0.5.5 through 0.23.1rc0 — a wide range covering most production GGUF-serving deployments.
The bug sits in the GGUF dequantization kernels: tensor dimension values are integer-truncated before being used to size GPU buffer allocations.
A crafted GGUF model with inflated dimension metadata causes the kernel to allocate a smaller buffer than the dequantization write requires, producing a heap-based buffer overflow on the GPU.
The overflow exposes uninitialized GPU memory from prior operations — which in a multi-tenant setup may contain embeddings, prompt activations, or API tokens from other users' inference requests.
No remote code execution on the host CPU has been demonstrated through this path alone, but the information leak is the primary risk in shared inference clusters.
vLLM patched the issue in version 0.23.1rc0 by widening the dimension type before allocation sizing.

Why It Matters

vLLM is the dominant open-source inference engine for serving open-weight models at scale. Many providers run it in multi-tenant configurations — sharing a single GPU across multiple customers or workloads. The GGUF format is the standard for quantized model distribution (llama.cpp ecosystem). A malicious model publisher can craft a GGUF file that triggers the overflow on any vulnerable vLLM instance that loads it. The leaked memory may contain fragments of other tenants' prompts, completions, or loaded model weights — a direct confidentiality breach in shared inference infrastructure.

What to Do

Upgrade immediately to vLLM ≥ 0.23.1rc0. This is the only fix.
Audit GGUF model sources — only load models from trusted publishers with verified checksums. Treat any untrusted GGUF file as potentially weaponized.
Isolate tenants — if you run multi-tenant vLLM, enforce per-tenant GPU partitioning (MIG, MPS, or separate containers) to limit blast radius.
Monitor GPU error logs for dequantization failures, OOM errors on unexpectedly small allocations, or CUDA memory violations — these may indicate exploitation attempts.

CVE-2026-53923 — vLLM GGUF Dequantization Bug Leaks GPU Memory Between Tenants

What Happened

Why It Matters

What to Do

Sources