arXiv — FORGE Multi-Agent Automated Exploitation and Detection Pipeline

AI relevance: FORGE uses five specialized LLM agents to automate exploit generation, vulnerability prioritization, and detection-rule engineering — directly applicable to AI-operated vulnerability management pipelines in enterprise SOC workflows.

What happened

  • Researchers published FORGE, a multi-agent system accepted at the AgentCy Workshop (ARES 2026), that bridges three siloed research areas: proof-of-concept exploit generation, vulnerability prioritization, and detection rule engineering.
  • The pipeline uses five specialized agents — Intel, Generator, Planner, Exploit, and Detector — executing in a fixed sequence across CVE metadata.
  • The key innovation is "graduated exploitation depth": exploits are scored on a four-level taxonomy (L0: no evidence through L3: full compromise), and deeper exploitation produces richer OpenTelemetry traces for detection-rule generation.
  • Evaluation on 603 CVEs from the CVE-GENIE dataset achieved 67.8% end-to-end L1+ exploitation at USD 1.50 per CVE, across eight languages and 187 CWE types.
  • Exploitation rates remained near 68% regardless of EPSS or CVSS band, suggesting that pattern-level exploit reachability is largely independent from existing metadata-based prioritization schemes.
  • Detection rules generated from L2+ (deeper) exploitation achieved significantly higher span-normalized grounding than L1-derived rules (p=0.035), validating the graduated-depth approach for detection engineering.
  • 93.4% of generated Snort rules produced zero false positives against a synthetic benign corpus.
  • A tiered knowledge architecture transfers build and exploitation experience across assessments, creating cumulative intelligence.

Why it matters

Vulnerability disclosure volumes now outpace organizational assessment capacity. FORGE demonstrates that multi-agent systems can meaningfully compress the cycle from CVE disclosure through exploit validation to detection-rule creation — and at a cost of USD 1.50 per CVE, it suggests this can scale. The finding that EPSS/CVSS scores don't correlate with automated exploit reachability is a challenge to current prioritization frameworks that security teams rely on daily.

What to do

  • Review your vulnerability prioritization process: if you rely solely on CVSS/EPSS, the FORGE results suggest you may be missing exploitable bugs that score poorly.
  • Watch for similar multi-agent automation in commercial vulnerability management tooling — this research will likely be productized within 12 months.
  • Invest in OpenTelemetry instrumentation for your critical services; the paper shows that richer exploitation traces directly enable higher-quality detection rules.
  • Security teams should consider what defensive posture is needed when automated exploit generation becomes commodity-capable.

Sources