MCPTox — Tool Poisoning Benchmark Shows 73% Attack Success Rate on MCP Agents
AI relevance: Tool poisoning embeds malicious instructions directly in MCP server tool descriptions — no code execution needed — making every agent that connects to untrusted MCP servers vulnerable to credential theft and unauthorized operations.
Key Findings
- First benchmark for MCP tool poisoning: MCPTox evaluates 20 prominent LLM agents against poisoned tool descriptions across 45 live MCP servers and 353 authentic tools.
- 72.8% attack success rate on o1-mini — and more capable models tend to be more susceptible, because the attack exploits their stronger instruction-following abilities.
- Three attack templates generate 1,312 test cases across 10 risk categories, including SSH key exfiltration, file system abuse, and unauthorized API calls — all triggered through tool metadata alone.
- Safety alignment is largely ineffective: even the most resistant model (Claude 3.7 Sonnet) refused fewer than 3% of poisoning attacks, because the malicious instructions use legitimate tools for unauthorized operations.
- Distinct from indirect prompt injection: repurposing IPI benchmark payloads for tool poisoning yielded near-zero success, confirming this is a separate attack vector that existing benchmarks miss.
- Attack mechanism: poisoned tool descriptions are injected during the MCP registration phase, entering the LLM's context before any user request — the agent then follows hidden rules embedded in seemingly legitimate tool metadata.
Why It Matters
Tool poisoning is fundamentally different from code-level vulnerabilities. The attack lives in plain text — tool descriptions that hosts load without scrutiny. An attacker doesn't need to compromise server code; they just need to publish a poisoned server to a registry. When an agent connects, the malicious instructions are baked into the context and executed alongside legitimate tool calls. The benchmark proves this works at scale across real-world servers and modern agents.
What to Do
- Filter tool descriptions: Run heuristic or ML-based checks on MCP tool metadata before loading into agent context, flagging instructions that attempt to override agent behavior.
- Least-privilege tool access: Scope MCP server permissions to only the tools your agent actually needs — reduce the blast radius when a poisoned tool is discovered.
- Monitor tool call patterns: Alert on unusual tool invocations (e.g., a file tool accessing credential paths during an unrelated task).
- Use MCPTox for testing: Run your agent against the MCPTox benchmark before deploying to production to establish a baseline robustness score.