vLLM — Mixture-of-Models routing on AMD GPUs (vLLM-SR)
• Category: Research
AI relevance: If you operate an agent or LLM gateway, a MoM router is infrastructure that decides which model answers and which guardrails run — it directly shapes cost, latency, and safety outcomes in production.
- Concept: “Mixture-of-Models” (MoM) is inter-model orchestration (route a whole request to a model), not “Mixture-of-Experts” (MoE) inside one model.
- Claimed system: vLLM Semantic Router (vLLM-SR) v0.1 routes queries across 6 specialized models using multiple signals and rule-based decisions.
- Hardware note: they describe running the demo on AMD MI300X/MI355X GPUs — a useful data point if you’re evaluating non-NVIDIA inference capacity.
- Signals as policy surface: routing decisions use signals like domain/intent/language/“deep thinking” vs “fast QA” to select a model and reasoning mode.
- Guardrails as infrastructure: their decision matrix includes a “guardrails” path (e.g., jailbreak keyword trigger) that routes to a smaller model.
- Operational takeaway: once routing is externalized, you can add controls (cost caps, safety tiers, audit logging) without prompt spaghetti.
- Tradeoff: more moving parts means more failure modes (signal drift, rule bugs, per-model regression) — you need observability around routing correctness.
Why it matters
- Model selection becomes a security boundary: if unsafe inputs can steer a request toward a less-restricted model, routing logic is part of your threat model.
- Cost/latency control: MoM offers a practical way to avoid sending trivial prompts to your most expensive model.
- Multi-vendor reality: most teams will run multiple models (open + closed, small + large); routers are how that becomes coherent.
What to do
- Define routing objectives: pick a small set of measurable goals (cost ceiling, p95 latency, safety tiering) before adding rules.
- Log the routing decision: record selected model, matched rule/signal, and any safety flags for every request (this is your “router audit trail”).
- Test for steering attacks: add red-team prompts that try to push “unsafe” content into a cheaper/unguarded path (treat it like authz testing).
- Version rules like code: change control + canarying for router policies; a routing tweak can be as impactful as a model upgrade.