arXiv — Trojan Backdoors in Agentic Workspaces Reach 95.5% ASR
Researchers from Renmin University of China demonstrate that multi-step trojan injections embedded in workspace files can achieve a 95.5% attack success rate against GPT-5.4 in a simulated agentic harness — far outpacing single-turn prompt injection attacks that register near-zero on the same model.
Key findings
- ClawTrojan benchmark — Designed to expose multi-step trojan attacks in local agentic harnesses where an LLM reads, writes, and reuses workspace files across sessions (e.g., coding agents, personal automation agents).
- Multi-step persistence — Unlike traditional prompt injection that targets a single conversation turn, the attack writes hidden instructions into files (config files, project notes) that the agent re-reads later, turning transient text into durable control content.
- 95.5% ASR on GPT-5.4 — In an OpenClaw-style simulated workspace, ClawTrojan achieved a 95.5% attack success rate with GPT-5.4, while existing single-turn prompt-injection attacks produce near-zero ASR on the same model.
- Harmless-looking individual steps — Each step of the trojan chain appears benign on its own (e.g., a project note defining a release-report format, a config file referencing a private file), but together they can produce irreversible data exfiltration or privilege abuse.
- Single-step defenses fail — Existing prompt-injection detectors inspect each step in isolation, blocking overtly harmful actions but missing the earlier write operation that plants the backdoor.
DASGuard: proposed defense
- Scans control-like text in sensitive local workspace files.
- Traces the provenance of each instruction back to its source.
- Removes control content that does not originate from a trusted source.
- Combines runtime attack blocking with sanitized workspace commits.
Why it matters
This work formalizes a class of attacks unique to agentic harnesses — systems where LLMs have persistent local filesystems, tool access, and cross-session state. Once an injection becomes a workspace file, it survives conversation resets and traditional injection filters. As more AI coding and automation agents adopt persistent workspaces (Claude Code, Codex, OpenClaw), this attack surface grows.
What to do
- Treat workspace files written by agents as untrusted until provenance is verified.
- Implement file-level provenance tracking for agent-generated content (origin tags, git attribution).
- Scan sensitive workspace files (config, system prompts, project docs) for control-like instructions before reuse.
- Consider workspace sandboxing: separate agent-writable directories from agent-readable control content.