arXiv — Measuring AI agents on multi-step cyber attack scenarios

2026-03-22 Research by al-ice.ai Editorial

AI relevance: The paper measures how modern agentic models perform across long attack chains, which is directly relevant to defenders assessing real-world offensive lift from frontier AI systems.

The authors evaluate seven frontier models released between August 2024 and February 2026 on two purpose-built cyber ranges.
The main benchmark is a 32-step corporate network attack that requires chaining reconnaissance, access, privilege, and follow-on actions over long horizons.
A second benchmark models a 7-step industrial control system attack, which remains materially harder for the tested systems.
The paper reports a log-linear gain from more inference-time compute: moving from 10M to 100M tokens improved performance by as much as 59% in their setup.
Model generations also improved at fixed budgets: on the corporate range, average completed steps at 10M tokens increased from 1.7 for GPT-4o to 9.8 for Opus 4.6.
The best single run reached 22 of 32 steps, which the authors map to roughly 6 of an estimated 14 human-expert hours on that scenario.
ICS performance is still limited, but the latest models were the first to complete steps reliably, averaging about 1.2–1.4 of 7 steps with a maximum of 3.
The results suggest frontier models are not yet autonomous end-to-end operators, but they are becoming meaningfully better at sustained, multi-tool offensive workflows.

Why it matters

Short benchmark wins can understate risk; this paper focuses on long action sequences, which is closer to how real intrusions unfold.
The clean scaling with token budget means capability can improve through operator-side spend and persistence, not just new model releases.
Security teams should plan around AI raising the speed and parallelism of offensive experimentation even before fully autonomous attacks arrive.

What to do

Benchmark your own exposure: test whether your detections and controls still hold when attackers can iterate faster with agentic help.
Defend the full chain: prioritize controls that break multi-step progression, not just single-event alerts.
Track inference-time scaling as a risk factor: policy and eval work should account for what models can do with larger token budgets and longer runs.

arXiv — Measuring AI agents on multi-step cyber attack scenarios

Why it matters

What to do

Sources