arXiv — VeriGrey greybox agent validation

AI relevance: VeriGrey targets agentic systems that invoke real tools, which makes it directly useful for testing prompt-injection and tool-abuse risk in production LLM assistants.

  • The paper proposes greybox validation for LLM agents, borrowing a fuzzing mindset instead of relying on one-shot jailbreak prompts or static policy checks.
  • Its key feedback signal is the sequence of tool invocations, which the authors treat as a proxy for interesting execution paths during agent testing.
  • Rather than random prompt mutation, VeriGrey tries to bind the attack goal to the task goal, so the injection becomes part of what the agent thinks it must do to succeed.
  • On AgentDojo, the paper reports 33% better efficacy at finding indirect prompt-injection vulnerabilities versus a black-box baseline with a GPT-4.1 backend.
  • The evaluation spans multiple verticals, and the authors say the improvement is not limited to a single workflow class such as workspace tasks.
  • An ablation study matters here: when the feedback function is disabled, bug-finding efficacy drops, which suggests the tool-trace signal is doing real work.
  • The real-world case studies are the interesting part for operators: the paper applies VeriGrey to Gemini CLI and OpenClaw, not just a lab benchmark.
  • In the OpenClaw experiment, the authors report finding malicious-skill variants from a set of 10 samples with 10/10 success on Kimi-K2.5 and 9/10 on Opus 4.6.
  • The broader takeaway is that agent validation probably needs execution-guided testing, because dangerous behavior often sits behind multi-step tool use rather than a single bad answer.

Why it matters

  • Most agent-security evaluations still look too black-box and prompt-centric; this paper argues the attack surface lives in tool trajectories and contingent plans.
  • If the results hold up, teams deploying MCP servers, browser agents, coding agents, or internal copilots should expect coverage gaps from simple red-team prompting alone.
  • The OpenClaw and Gemini CLI case studies make the paper more operationally relevant than a benchmark-only result: it is trying to break real agent stacks with tools, skills, and autonomy.

What to do

  • Log and inspect tool traces: if you cannot see agent tool-call sequences, you cannot reproduce or prioritize the weird paths a greybox tester would find.
  • Test task-linked injections: evaluate prompts where the malicious instruction is embedded into the business task, not just obvious hostile payloads.
  • Use benchmarks, then graduate to your real stack: start with AgentDojo-style coverage, then run equivalent tests against your own connectors, prompts, skills, and permission model.

Sources