arXiv — System prompt extraction via code agents (JustAsk)

• Category: Research

AI relevance: in agentic systems, “secret” system prompts are effectively a security boundary — and this work argues they can be autonomously probed and partially recovered via normal user interaction.

  • Claim: system prompt extraction is an emergent vulnerability of autonomous code agents, because tool use + long-horizon interaction expands the attack surface beyond single-shot chat prompts.
  • Technique: the paper introduces JustAsk, a framework that autonomously discovers prompt-extraction strategies (no handcrafted prompts, no privileged access beyond standard interaction).
  • Framing: extraction is treated as an online exploration problem; the authors describe using UCB-style strategy selection over a hierarchy of “skills.”
  • Mechanism (high level): the skills exploit imperfect generalization of system instructions and tension between helpfulness vs. safety constraints.
  • Evaluation: reported testing across 41 black-box commercial models from multiple providers, with “full or near-complete” recovery in many cases (per abstract).
  • Security takeaway: if your agent’s safety posture relies on keeping system prompts hidden, plan for prompt disclosure as a realistic failure mode.
  • Practical angle: code agents are especially exposed because they can iteratively ask questions, run tools, reflect, and retry — i.e., they are built to explore.

Why it matters

  • Prompt secrecy isn’t a control: “security by secret system prompt” doesn’t hold up as agents get more autonomous and persistent.
  • Second-order risk: if an attacker can infer guardrail wording, they can craft tailored jailbreaks or tool-use manipulations that target those exact constraints.
  • Ops implication: you may need to treat system prompts like config that will leak (similar to client-side code), and build compensating controls elsewhere.

What to do

  1. Threat-model disclosure: assume system prompts can be extracted and avoid embedding secrets (API keys, internal URLs, “hidden policies”).
  2. Move controls out of prompts: enforce tool permissions, data access, and policy in code (capability boundaries, allowlists, scoped credentials, isolated runtimes).
  3. Instrument probing: log and alert on interaction patterns consistent with prompt exfil / model probing (repeated meta-questions, structured extraction attempts, high retry rates).
  4. Red-team your agent loop: test multi-step prompt injection / probing across the entire agent workflow, not just the base model chat endpoint.

Sources