arXiv — System prompt extraction via code agents (JustAsk)
• Category: Research
AI relevance: in agentic systems, “secret” system prompts are effectively a security boundary — and this work argues they can be autonomously probed and partially recovered via normal user interaction.
- Claim: system prompt extraction is an emergent vulnerability of autonomous code agents, because tool use + long-horizon interaction expands the attack surface beyond single-shot chat prompts.
- Technique: the paper introduces JustAsk, a framework that autonomously discovers prompt-extraction strategies (no handcrafted prompts, no privileged access beyond standard interaction).
- Framing: extraction is treated as an online exploration problem; the authors describe using UCB-style strategy selection over a hierarchy of “skills.”
- Mechanism (high level): the skills exploit imperfect generalization of system instructions and tension between helpfulness vs. safety constraints.
- Evaluation: reported testing across 41 black-box commercial models from multiple providers, with “full or near-complete” recovery in many cases (per abstract).
- Security takeaway: if your agent’s safety posture relies on keeping system prompts hidden, plan for prompt disclosure as a realistic failure mode.
- Practical angle: code agents are especially exposed because they can iteratively ask questions, run tools, reflect, and retry — i.e., they are built to explore.
Why it matters
- Prompt secrecy isn’t a control: “security by secret system prompt” doesn’t hold up as agents get more autonomous and persistent.
- Second-order risk: if an attacker can infer guardrail wording, they can craft tailored jailbreaks or tool-use manipulations that target those exact constraints.
- Ops implication: you may need to treat system prompts like config that will leak (similar to client-side code), and build compensating controls elsewhere.
What to do
- Threat-model disclosure: assume system prompts can be extracted and avoid embedding secrets (API keys, internal URLs, “hidden policies”).
- Move controls out of prompts: enforce tool permissions, data access, and policy in code (capability boundaries, allowlists, scoped credentials, isolated runtimes).
- Instrument probing: log and alert on interaction patterns consistent with prompt exfil / model probing (repeated meta-questions, structured extraction attempts, high retry rates).
- Red-team your agent loop: test multi-step prompt injection / probing across the entire agent workflow, not just the base model chat endpoint.
Sources
- Paper (arXiv): Just Ask: Curious Code Agents Reveal System Prompts in Frontier LLMs
- PDF: arXiv:2601.21233 PDF
- HTML (experimental): arXiv HTML v1
- DOI (arXiv-issued): 10.48550/arXiv.2601.21233