arXiv — CVE-Factory: Scaling Expert-Level Agentic Tasks for Code Security Vulnerability

AI relevance: CVE-Factory builds agent-ready, executable vulnerability tasks so security teams can evaluate and train AI coding agents against real CVE conditions.

  • CVE-Factory is a multi-agent pipeline that turns sparse CVE metadata into fully executable security tasks, avoiding costly manual reproduction.
  • Authors report 95% solution correctness and 96% environment fidelity when cross-validating against human expert reproductions.
  • The system auto-generates a continuously updated benchmark, LiveCVEBench, spanning 190 tasks, 14 languages, and 153 repositories.
  • LiveCVEBench explicitly includes AI-tooling vulnerabilities, keeping the benchmark aligned with modern agent stacks.
  • CVE-Factory also synthesizes 1,000+ executable training environments for scaling agentic code-security training.
  • A fine-tuned Qwen3-32B model improves from 5.3% → 35.8% on LiveCVEBench, beating Claude 4.5 Sonnet on the reported setup.
  • All assets are open-sourced: CVE-Factory, LiveCVEBench, Abacus-cve model, datasets, and leaderboard.

Why it matters

  • Agentic security evaluations are stuck on stale or hand-built tasks; automated CVE-to-task pipelines unlock continuous, realistic testing.
  • LiveCVEBench offers a shared, evolving benchmark for measuring how well code agents handle real-world vulnerability reproduction.
  • Open-source artifacts let defenders stress-test AI coding assistants before deploying them in CI/CD or production.

What to do

  • Review LiveCVEBench: Use its tasks to evaluate your internal coding agents or security copilots.
  • Adopt the pipeline: If you maintain vulnerability programs, consider adding CVE-Factory-style generation for continuous testing.
  • Benchmark model upgrades: Track improvements against LiveCVEBench before granting agents higher privileges.
  • Contribute: Submit new CVE task data or reproduce findings to keep the benchmark fresh.

Sources