Microsoft — turning threat reports into detection insights with AI

• Category: Security

  • Problem: turning long, messy threat intel / red-team reports into usable detection engineering work is slow (days to weeks), and it’s easy to miss details.
  • Workflow idea: use an LLM to extract candidate TTPs + metadata from reports, then normalize and map them to MITRE ATT&CK.
  • Coverage step: compare extracted TTPs against your existing detection catalog to label each item as “likely covered” vs “likely gap.”
  • How they reduce false positives: combine vector similarity search (to shortlist candidate detections) with LLM-based validation (to check whether the mapping is actually plausible).
  • Context preservation: ingestion keeps document structure (headings/lists/etc.) because where a detail appears can change how it should be interpreted.
  • Output: a prioritized list of detection opportunities + likely gaps — explicitly framed as a starting point, not an auto-ship decision.
  • Explicit guardrails: use structured outputs (schemas), deterministic prompts for critical steps, and reviewer checkpoints for “coverage vs gap” conclusions.

Why it matters

  • High-signal reports pile up during active campaigns; anything that cuts “time-to-first-detection-draft” without lowering rigor helps.
  • This is a practical pattern for security automation: LLMs for extraction + normalization, classic IR/IRL tools for retrieval, and humans for final validation.
  • It also highlights a non-obvious failure mode: “coverage” inferred from text similarity can be wrong when telemetry isn’t present, scope differs, or correlation logic is missing — so you need explicit validation loops.

What to do

  1. Build a detection catalog you can query: standardize fields (title, description, ATT&CK mappings, code/query language, required telemetry) and make it searchable.
  2. Automate TTP extraction as a first pass: run LLM extraction into a strict schema (technique, evidence snippet, confidence, telemetry needed).
  3. Do two-stage matching: vector search to shortlist candidate detections, then an LLM (or rule-based checks) to validate whether the match really covers the behavior.
  4. Gate “likely gaps” with real evidence: simulate or replay telemetry (where possible) before investing deeply in new detections.
  5. Measure drift: keep a small gold set of reports + expected TTPs/mappings so prompt/model changes don’t silently degrade quality.

Sources