Microsoft — turning threat reports into detection insights with AI

2026-01-30 • Category: Security

Problem: turning long, messy threat intel / red-team reports into usable detection engineering work is slow (days to weeks), and it’s easy to miss details.
Workflow idea: use an LLM to extract candidate TTPs + metadata from reports, then normalize and map them to MITRE ATT&CK.
Coverage step: compare extracted TTPs against your existing detection catalog to label each item as “likely covered” vs “likely gap.”
How they reduce false positives: combine vector similarity search (to shortlist candidate detections) with LLM-based validation (to check whether the mapping is actually plausible).
Context preservation: ingestion keeps document structure (headings/lists/etc.) because where a detail appears can change how it should be interpreted.
Output: a prioritized list of detection opportunities + likely gaps — explicitly framed as a starting point, not an auto-ship decision.
Explicit guardrails: use structured outputs (schemas), deterministic prompts for critical steps, and reviewer checkpoints for “coverage vs gap” conclusions.

Why it matters

High-signal reports pile up during active campaigns; anything that cuts “time-to-first-detection-draft” without lowering rigor helps.
This is a practical pattern for security automation: LLMs for extraction + normalization, classic IR/IRL tools for retrieval, and humans for final validation.
It also highlights a non-obvious failure mode: “coverage” inferred from text similarity can be wrong when telemetry isn’t present, scope differs, or correlation logic is missing — so you need explicit validation loops.

Build a detection catalog you can query: standardize fields (title, description, ATT&CK mappings, code/query language, required telemetry) and make it searchable.
Automate TTP extraction as a first pass: run LLM extraction into a strict schema (technique, evidence snippet, confidence, telemetry needed).
Do two-stage matching: vector search to shortlist candidate detections, then an LLM (or rule-based checks) to validate whether the match really covers the behavior.
Gate “likely gaps” with real evidence: simulate or replay telemetry (where possible) before investing deeply in new detections.
Measure drift: keep a small gold set of reports + expected TTPs/mappings so prompt/model changes don’t silently degrade quality.