arXiv — LLMs Fail at Open-Ended Threat Hunting (3.8% Best Score)

AI relevance: This benchmark directly measures whether LLM-powered SecOps agents can autonomously perform the core SOC task of threat hunting — a capability vendors are actively marketing today.

Key Findings

  • Researchers Chona, Kozlov, and Kumar introduced the Cyber Defense Benchmark, wrapping 106 real attack procedures from the OTRF Security-Datasets corpus into a Gymnasium RL environment (arXiv:2604.19533).
  • The benchmark spans 86 MITRE ATT&CK sub-techniques across 12 tactics, with each episode presenting 75,000–135,000 raw Windows event log records in an obfuscated SQLite database.
  • Five frontier models were tested: Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5, and Gemini 3 Flash.
  • The best model, Claude Opus 4.6, correctly flagged only 3.8% of malicious events on average. No single run by any model found all flags across any campaign.
  • Passing was defined as ≥50% recall on every ATT&CK tactic — the minimum bar for unsupervised SOC deployment. Claude Opus 4.6 cleared this on only 5 of 13 tactics; all other models cleared zero.
  • The gap between strong performance on curated Q&A security benchmarks and abysmal results on open-ended evidence-driven hunting is stark, suggesting existing eval suites dramatically overestimate real-world SecOps readiness.

Why It Matters

Multiple vendors are shipping or marketing "AI SOC analyst" products that promise autonomous threat hunting. This paper provides the first rigorous, CTF-scored evaluation of that claim against real attack telemetry — and the results are sobering. Organizations deploying LLM agents for security operations should treat autonomous hunting as experimental, not production-ready.

What to Do

  • Use LLM agents for triage and summarization of identified alerts, not for open-ended discovery across raw telemetry.
  • Require Sigma-rule or detection-rule ground truth before trusting agent-flagged events.
  • Monitor this benchmark's updates — newer models and prompting strategies may narrow the gap.

Sources