Orca Security — Pickle RCE in SGLang LLM Framework (CVE-2026-3059/3060)

AI relevance: SGLang is a widely deployed open-source framework for serving LLMs and multimodal models; pickle deserialization on network-exposed ZMQ ports gives any network-adjacent attacker unauthenticated RCE on GPU inference servers running production AI workloads.

  • CVE-2026-3059 (CVSS 9.8): unsafe pickle.loads() on untrusted data received via ZMQ in the multimodal generation broker (scheduler_client.py). No authentication required; network-reachable.
  • CVE-2026-3060 (CVSS 9.8): same pickle deserialization flaw in the disaggregation encoder receiver (encode_receiver.py), also network-reachable and unauthenticated.
  • A third CVE-2026-3989 (CVSS 7.8) covers insecure pickle in the crash-dump replay utility (replay_request_dump.py), local-only but still exploitable.
  • All three stem from Python's pickle module processing data from network sockets — a pattern Orca calls "Pickle in the Pipeline," endemic across ML/AI infrastructure.
  • Orca disclosed through CERT/CC (case VU#665416). At time of publication, no official SGLang patch exists and maintainers have not responded to coordinated disclosure.
  • The proposed patch (unmerged) replaces pickle with a safe serialization format, but operators must manually apply it or block network access to the affected ports.
  • SGLang competes with vLLM and TGI for LLM serving; deployments often expose ZMQ ports on shared GPU clusters, making network-adjacent exploitation realistic.
  • The broader pattern — pickle on untrusted network data — repeats across MLflow, Ray, PyTorch, and other ML frameworks, making this a systemic supply-chain class issue.

Why it matters

Pickle deserialization bugs in ML infrastructure are not new, but they keep appearing because the Python ML ecosystem defaults to pickle for inter-process communication. SGLang's case is particularly dangerous because the vulnerable ports are network-exposed by design (for multimodal and disaggregated serving), require zero authentication, and sit on GPU servers that typically hold model weights, API keys, and access to training data stores. A single unauthenticated packet turns an inference server into a foothold for lateral movement across the AI cluster. The lack of a vendor patch at disclosure time is a reminder that open-source AI frameworks often lack mature security response processes.

What to do

  • Block exposed ports: Immediately firewall ZMQ ports used by SGLang's multimodal broker and disaggregation encoder. Restrict access to only trusted cluster nodes.
  • Apply the unmerged patch: Review the proposed fix from Orca/CERT/CC and consider cherry-picking it if you run SGLang in production.
  • Scan for pickle usage: Audit your broader AI/ML stack for pickle.loads() on network-received data. Orca's research notes this is a systemic pattern.
  • Network segmentation: Isolate inference-serving infrastructure from general network access. GPU servers should not be reachable from untrusted segments.
  • Monitor for exploitation: Look for unexpected ZMQ connections, anomalous Python child processes on inference nodes, or outbound network calls from SGLang hosts.
  • Consider alternatives: If SGLang maintainers remain unresponsive, evaluate whether vLLM or TGI better meet your security response requirements.

Sources