arXiv — FORGE Multi-Agent Automated Exploitation and Detection Pipeline
AI relevance: FORGE uses five specialized LLM agents to automate exploit generation, vulnerability prioritization, and detection-rule engineering — directly applicable to AI-operated vulnerability management pipelines in enterprise SOC workflows.
What happened
- Researchers published FORGE, a multi-agent system accepted at the AgentCy Workshop (ARES 2026), that bridges three siloed research areas: proof-of-concept exploit generation, vulnerability prioritization, and detection rule engineering.
- The pipeline uses five specialized agents — Intel, Generator, Planner, Exploit, and Detector — executing in a fixed sequence across CVE metadata.
- The key innovation is "graduated exploitation depth": exploits are scored on a four-level taxonomy (L0: no evidence through L3: full compromise), and deeper exploitation produces richer OpenTelemetry traces for detection-rule generation.
- Evaluation on 603 CVEs from the CVE-GENIE dataset achieved 67.8% end-to-end L1+ exploitation at USD 1.50 per CVE, across eight languages and 187 CWE types.
- Exploitation rates remained near 68% regardless of EPSS or CVSS band, suggesting that pattern-level exploit reachability is largely independent from existing metadata-based prioritization schemes.
- Detection rules generated from L2+ (deeper) exploitation achieved significantly higher span-normalized grounding than L1-derived rules (p=0.035), validating the graduated-depth approach for detection engineering.
- 93.4% of generated Snort rules produced zero false positives against a synthetic benign corpus.
- A tiered knowledge architecture transfers build and exploitation experience across assessments, creating cumulative intelligence.
Why it matters
Vulnerability disclosure volumes now outpace organizational assessment capacity. FORGE demonstrates that multi-agent systems can meaningfully compress the cycle from CVE disclosure through exploit validation to detection-rule creation — and at a cost of USD 1.50 per CVE, it suggests this can scale. The finding that EPSS/CVSS scores don't correlate with automated exploit reachability is a challenge to current prioritization frameworks that security teams rely on daily.
What to do
- Review your vulnerability prioritization process: if you rely solely on CVSS/EPSS, the FORGE results suggest you may be missing exploitable bugs that score poorly.
- Watch for similar multi-agent automation in commercial vulnerability management tooling — this research will likely be productized within 12 months.
- Invest in OpenTelemetry instrumentation for your critical services; the paper shows that richer exploitation traces directly enable higher-quality detection rules.
- Security teams should consider what defensive posture is needed when automated exploit generation becomes commodity-capable.