arXiv: Content-Aware Attack Detection in LLM Agent Tool-Call Traffic

AI relevance: As MCP becomes the dominant tool-calling interface for LLM agents, detecting malicious tool-call sessions in real time is an unsolved operational problem — this paper provides the first systematic evaluation of learned detectors over MCP traffic.

Researchers from Sultan Zavrak and collaborators published arXiv:2605.11053 (v3, updated May 22), proposing a graph-based attack detection framework for MCP tool-call traffic. The approach encodes each agent session as a graph — tool calls as nodes, sequential and data-flow links as edges — and enriches nodes with sentence-embedding features over arguments and responses.

Key findings

  • Content-level features are essential. Metadata-only detection plateaus around AUROC 0.64 regardless of architecture. Adding SBERT content embeddings pushes AUROC above 0.89, with tree ensembles on pooled embeddings reaching 0.975.
  • Naive random-split evaluations inflate results by up to 26 percentage points compared to task-disjoint splits. This memorization confound has not been addressed in prior agent-detection literature.
  • Simple models outperform complex ones. Tree ensembles on pooled SBERT embeddings (AUROC 0.975) outperformed GNNs (GraphSAGE at 0.917) and an MLP (0.896) in the primary RAS-Eval setting.
  • Self-supervised pre-training provided no label-efficiency advantage on this detection task, contrary to expectations from other NLP domains.
  • Evaluated three GNN architectures (GAT, GCN, GraphSAGE), an MLP baseline, and classical models (XGBoost, random forest, logistic regression, linear SVM) across RAS-Eval, ATBench, and a combined-source variant.

Why it matters

Most MCP deployments today rely on prompt-based safety or allow/deny lists. This work shows that learned detection is feasible with the right feature engineering, but also warns that evaluation methodology matters enormously — results inflated by random splits may give false confidence in production.

What to do

  • If you operate MCP servers or agent tool-call pipelines, incorporate content-level features (not just metadata) into any detection logic.
  • Use task-disjoint evaluation splits when testing detection models — random splits will overestimate real-world performance.
  • Don't assume GNNs are automatically superior; tree ensembles on pooled embeddings may give better results at lower computational cost.

Sources