arXiv — Measuring AI agents on multi-step cyber attack scenarios
AI relevance: The paper measures how modern agentic models perform across long attack chains, which is directly relevant to defenders assessing real-world offensive lift from frontier AI systems.
- The authors evaluate seven frontier models released between August 2024 and February 2026 on two purpose-built cyber ranges.
- The main benchmark is a 32-step corporate network attack that requires chaining reconnaissance, access, privilege, and follow-on actions over long horizons.
- A second benchmark models a 7-step industrial control system attack, which remains materially harder for the tested systems.
- The paper reports a log-linear gain from more inference-time compute: moving from 10M to 100M tokens improved performance by as much as 59% in their setup.
- Model generations also improved at fixed budgets: on the corporate range, average completed steps at 10M tokens increased from 1.7 for GPT-4o to 9.8 for Opus 4.6.
- The best single run reached 22 of 32 steps, which the authors map to roughly 6 of an estimated 14 human-expert hours on that scenario.
- ICS performance is still limited, but the latest models were the first to complete steps reliably, averaging about 1.2–1.4 of 7 steps with a maximum of 3.
- The results suggest frontier models are not yet autonomous end-to-end operators, but they are becoming meaningfully better at sustained, multi-tool offensive workflows.
Why it matters
- Short benchmark wins can understate risk; this paper focuses on long action sequences, which is closer to how real intrusions unfold.
- The clean scaling with token budget means capability can improve through operator-side spend and persistence, not just new model releases.
- Security teams should plan around AI raising the speed and parallelism of offensive experimentation even before fully autonomous attacks arrive.
What to do
- Benchmark your own exposure: test whether your detections and controls still hold when attackers can iterate faster with agentic help.
- Defend the full chain: prioritize controls that break multi-step progression, not just single-event alerts.
- Track inference-time scaling as a risk factor: policy and eval work should account for what models can do with larger token budgets and longer runs.