arXiv — LLMs Fail at Open-Ended Threat Hunting (3.8% Best Score)
AI relevance: This benchmark directly measures whether LLM-powered SecOps agents can autonomously perform the core SOC task of threat hunting — a capability vendors are actively marketing today.
Key Findings
- Researchers Chona, Kozlov, and Kumar introduced the Cyber Defense Benchmark, wrapping 106 real attack procedures from the OTRF Security-Datasets corpus into a Gymnasium RL environment (arXiv:2604.19533).
- The benchmark spans 86 MITRE ATT&CK sub-techniques across 12 tactics, with each episode presenting 75,000–135,000 raw Windows event log records in an obfuscated SQLite database.
- Five frontier models were tested: Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5, and Gemini 3 Flash.
- The best model, Claude Opus 4.6, correctly flagged only 3.8% of malicious events on average. No single run by any model found all flags across any campaign.
- Passing was defined as ≥50% recall on every ATT&CK tactic — the minimum bar for unsupervised SOC deployment. Claude Opus 4.6 cleared this on only 5 of 13 tactics; all other models cleared zero.
- The gap between strong performance on curated Q&A security benchmarks and abysmal results on open-ended evidence-driven hunting is stark, suggesting existing eval suites dramatically overestimate real-world SecOps readiness.
Why It Matters
Multiple vendors are shipping or marketing "AI SOC analyst" products that promise autonomous threat hunting. This paper provides the first rigorous, CTF-scored evaluation of that claim against real attack telemetry — and the results are sobering. Organizations deploying LLM agents for security operations should treat autonomous hunting as experimental, not production-ready.
What to Do
- Use LLM agents for triage and summarization of identified alerts, not for open-ended discovery across raw telemetry.
- Require Sigma-rule or detection-rule ground truth before trusting agent-flagged events.
- Monitor this benchmark's updates — newer models and prompting strategies may narrow the gap.