Black Hat 2026 — First Copilot Sandbox Escape, AI Agent Exploitation & Offensive Models

AI relevance: The briefings expose how AI agents deployed as trusted automation — Copilot assistants, shopping agents, coding tools — become the primary attack surface, with sandbox escapes, trust-handoff failures, and purpose-trained offensive models that beat frontier LLMs at agent exploitation at 70–125× lower cost.

Key Briefings

  • ChatMate: Remote Prompt Execution via Copilot Sandbox Escape — Rubrik Zero Labs researcher Ori Lahav demonstrates the first escape from the Microsoft Copilot sandbox to the host underneath. Uploading a single document triggers a full session takeover with blast radius across multiple Azure services. The attack class, "Remote Prompt Execution," gives attackers arbitrary prompt execution inside a victim's AI chat session — analogous to RCE but at the model layer.
  • Trusted Enough to Run: Breaking AI Agents in Official Workflows — Novee Security founding researcher Elad Meged dissects trust-handoff failures across Anthropic, Google, and OpenAI agent workflows. The core flaw: one pipeline stage marks state as "safe," but a downstream stage interprets that state more powerfully than the original check accounted for. Every major vendor's unattended automation is affected.
  • Cost-Effective, Private, Frontier-Grade: AI Agent Exploitation with a Fine-Tuned OSS Model — NVIDIA's Bar Lanyado and Eliya Cohen show that a fine-tuned 30B open-source model achieved a 56% exploit success rate against AI agents — edging out much larger frontier models while costing 70–125× less to run. A direct rebuttal to the assumption that offensive AI requires frontier-scale infrastructure.
  • Bye Bye AI: Hacking a Top-3 US Retailer's AI Shopping Assistant — Rein Security's Netanel Rubin and Netanel Avraham compromised an AI shopping assistant built on Google Vertex AI Search, sitting behind an LLM gateway designed to enforce intent-classification guardrails. The gateway's prompt-and-response monitoring missed the execution layer entirely — the agent's actual tool calls were ungoverned.

Why It Matters

These four briefings share a common thread: AI agents are deployed faster than they are tested, and the security controls wrapped around them — sandboxing, LLM gateways, intent classifiers — fail at the boundaries that matter. Copilot's sandbox was supposed to isolate document analysis. LLM gateways were supposed to filter malicious intent. Trust checks were supposed to propagate safely through pipeline stages. None of them held under real attack conditions.

The NVIDIA offensive-model result is a strategic signal: purpose-trained small models can outperform frontier LLMs at agent exploitation, making offensive AI accessible to any adversary who can fine-tune on a single GPU cluster. The barrier to entry for AI-powered agent attacks is dropping fast.

What to Do

  • Audit agent tool-call chains — LLM gateways that only inspect prompts and responses miss the execution layer. Map every tool your agents can invoke and verify that each call is governed independently of the model's intent classification.
  • Treat sandbox boundaries as adversarial — Copilot's sandbox escape demonstrates that document-upload surfaces are attack surfaces. Apply defense-in-depth: restrict host filesystem access, enforce egress controls, and monitor for cross-service lateral movement from AI sessions.
  • Test trust propagation in agent pipelines — If one stage sanitizes input and a downstream stage re-interprets it, you have a trust-handoff gap. Red-team your agent workflows with inputs that are "safe" at stage N but become dangerous at stage N+1.
  • Prepare for small-model offensive AI — A 30B model beating frontier LLMs at agent exploitation means defenders cannot assume attackers need expensive infrastructure. Calibrate threat models accordingly.

Sources