arXiv — System prompt extraction via code agents (JustAsk)

2026-01-31 • Category: Research

AI relevance: in agentic systems, “secret” system prompts are effectively a security boundary — and this work argues they can be autonomously probed and partially recovered via normal user interaction.

Claim: system prompt extraction is an emergent vulnerability of autonomous code agents, because tool use + long-horizon interaction expands the attack surface beyond single-shot chat prompts.
Technique: the paper introduces JustAsk, a framework that autonomously discovers prompt-extraction strategies (no handcrafted prompts, no privileged access beyond standard interaction).
Framing: extraction is treated as an online exploration problem; the authors describe using UCB-style strategy selection over a hierarchy of “skills.”
Mechanism (high level): the skills exploit imperfect generalization of system instructions and tension between helpfulness vs. safety constraints.
Evaluation: reported testing across 41 black-box commercial models from multiple providers, with “full or near-complete” recovery in many cases (per abstract).
Security takeaway: if your agent’s safety posture relies on keeping system prompts hidden, plan for prompt disclosure as a realistic failure mode.
Practical angle: code agents are especially exposed because they can iteratively ask questions, run tools, reflect, and retry — i.e., they are built to explore.

Why it matters

Prompt secrecy isn’t a control: “security by secret system prompt” doesn’t hold up as agents get more autonomous and persistent.
Second-order risk: if an attacker can infer guardrail wording, they can craft tailored jailbreaks or tool-use manipulations that target those exact constraints.
Ops implication: you may need to treat system prompts like config that will leak (similar to client-side code), and build compensating controls elsewhere.

What to do

Threat-model disclosure: assume system prompts can be extracted and avoid embedding secrets (API keys, internal URLs, “hidden policies”).
Move controls out of prompts: enforce tool permissions, data access, and policy in code (capability boundaries, allowlists, scoped credentials, isolated runtimes).
Instrument probing: log and alert on interaction patterns consistent with prompt exfil / model probing (repeated meta-questions, structured extraction attempts, high retry rates).
Red-team your agent loop: test multi-step prompt injection / probing across the entire agent workflow, not just the base model chat endpoint.

Sources

Paper (arXiv): Just Ask: Curious Code Agents Reveal System Prompts in Frontier LLMs
PDF: arXiv:2601.21233 PDF
HTML (experimental): arXiv HTML v1
DOI (arXiv-issued): 10.48550/arXiv.2601.21233