arXiv — Measuring AI agents on multi-step cyber attack scenarios

AI relevance: The paper measures how modern agentic models perform across long attack chains, which is directly relevant to defenders assessing real-world offensive lift from frontier AI systems.

  • The authors evaluate seven frontier models released between August 2024 and February 2026 on two purpose-built cyber ranges.
  • The main benchmark is a 32-step corporate network attack that requires chaining reconnaissance, access, privilege, and follow-on actions over long horizons.
  • A second benchmark models a 7-step industrial control system attack, which remains materially harder for the tested systems.
  • The paper reports a log-linear gain from more inference-time compute: moving from 10M to 100M tokens improved performance by as much as 59% in their setup.
  • Model generations also improved at fixed budgets: on the corporate range, average completed steps at 10M tokens increased from 1.7 for GPT-4o to 9.8 for Opus 4.6.
  • The best single run reached 22 of 32 steps, which the authors map to roughly 6 of an estimated 14 human-expert hours on that scenario.
  • ICS performance is still limited, but the latest models were the first to complete steps reliably, averaging about 1.2–1.4 of 7 steps with a maximum of 3.
  • The results suggest frontier models are not yet autonomous end-to-end operators, but they are becoming meaningfully better at sustained, multi-tool offensive workflows.

Why it matters

  • Short benchmark wins can understate risk; this paper focuses on long action sequences, which is closer to how real intrusions unfold.
  • The clean scaling with token budget means capability can improve through operator-side spend and persistence, not just new model releases.
  • Security teams should plan around AI raising the speed and parallelism of offensive experimentation even before fully autonomous attacks arrive.

What to do

  • Benchmark your own exposure: test whether your detections and controls still hold when attackers can iterate faster with agentic help.
  • Defend the full chain: prioritize controls that break multi-step progression, not just single-event alerts.
  • Track inference-time scaling as a risk factor: policy and eval work should account for what models can do with larger token budgets and longer runs.

Sources