arXiv — AgentSecBench: Formal Security Framework for LLM Agents
AI relevance: AgentSecBench applies a formal noninterference framework — borrowed from classical information-flow security — to the LLM agent loop, treating the conflation of instructions, data, and tool observations in a single generative channel as the core vulnerability.
- Researchers define intent-to-execution noninterference with permitted leakage, framing the agent's prompt as an unenforced security boundary rather than a true access control mechanism.
- Three security games are formalized: instruction-integrity (untrusted document injects adversarial instructions into a benign task), retrieval-confidentiality (retrieved context contains secrets outside the caller's authorization scope), and capability-integrity (tool observation names an action not admitted by the user's policy).
- The framework represents application policy as a projection onto authorized observations and capabilities, distinguishing prompt annotations (text descriptions of boundaries) from enforcing projections (actual channel closure before generation).
- Six defense classes were evaluated against Qwen3-0.6B and Qwen3-1.7B using paired adversarial and benign-control executions, measuring both adversarial advantage and whether a defense closes the relevant model-visible channel.
- Results show risk reduction follows channel closure only when the defense actually restricts the model-visible adversarial capability, not merely when it adds text-level warnings to the prompt.
- Exact-marker experiments use unambiguous ground truth — disclosure of a canary secret or selection of a forbidden capability — as sufficient witnesses of policy violation, though the authors note this does not exhaust semantic security.
Why it matters
This is one of the first papers to treat the LLM agent's fundamental architecture — all inputs merged into a single text channel — as a formal security boundary problem rather than an alignment or safety-filtering challenge. The three-game structure maps cleanly onto real attack classes: prompt injection (instruction-integrity), data leakage through RAG (retrieval-confidentiality), and tool hijacking (capability-integrity).
What to do
- When deploying LLM agents, audit whether your "security" controls actually close model-visible channels or merely annotate the prompt with text descriptions of boundaries.
- Use the three-game framework as an evaluation checklist for any agent architecture: can untrusted data change the trusted instruction, expose unauthorized secrets, or trigger unauthorized capabilities?