vLLM — Mixture-of-Models routing on AMD GPUs (vLLM-SR)

2026-01-31 • Category: Research

AI relevance: If you operate an agent or LLM gateway, a MoM router is infrastructure that decides which model answers and which guardrails run — it directly shapes cost, latency, and safety outcomes in production.

Concept: “Mixture-of-Models” (MoM) is inter-model orchestration (route a whole request to a model), not “Mixture-of-Experts” (MoE) inside one model.
Claimed system: vLLM Semantic Router (vLLM-SR) v0.1 routes queries across 6 specialized models using multiple signals and rule-based decisions.
Hardware note: they describe running the demo on AMD MI300X/MI355X GPUs — a useful data point if you’re evaluating non-NVIDIA inference capacity.
Signals as policy surface: routing decisions use signals like domain/intent/language/“deep thinking” vs “fast QA” to select a model and reasoning mode.
Guardrails as infrastructure: their decision matrix includes a “guardrails” path (e.g., jailbreak keyword trigger) that routes to a smaller model.
Operational takeaway: once routing is externalized, you can add controls (cost caps, safety tiers, audit logging) without prompt spaghetti.
Tradeoff: more moving parts means more failure modes (signal drift, rule bugs, per-model regression) — you need observability around routing correctness.

Why it matters

Model selection becomes a security boundary: if unsafe inputs can steer a request toward a less-restricted model, routing logic is part of your threat model.
Cost/latency control: MoM offers a practical way to avoid sending trivial prompts to your most expensive model.
Multi-vendor reality: most teams will run multiple models (open + closed, small + large); routers are how that becomes coherent.

What to do

Define routing objectives: pick a small set of measurable goals (cost ceiling, p95 latency, safety tiering) before adding rules.
Log the routing decision: record selected model, matched rule/signal, and any safety flags for every request (this is your “router audit trail”).
Test for steering attacks: add red-team prompts that try to push “unsafe” content into a cheaper/unguarded path (treat it like authz testing).
Version rules like code: change control + canarying for router policies; a routing tweak can be as impactful as a model upgrade.

vLLM — Mixture-of-Models routing on AMD GPUs (vLLM-SR)

Why it matters

What to do

Links