arXiv — Learning to Inject: automated prompt injection via reinforcement learning

2026-02-07 • Category: Research

AI relevance: The paper automates prompt injection against LLM agents, showing scalable attacks that can transfer across models used in agentic systems.

AutoInject is a reinforcement-learning framework that generates adversarial suffixes to maximize prompt-injection success while preserving benign utility.
The method targets universal, transferable suffixes rather than single-use prompts, enabling reuse across tasks and targets.
It supports both query-based optimization and transfer attacks to unseen models and tasks.
The authors use a 1.5B-parameter suffix generator to craft attacks without white-box access to target models.
On the AgentDojo benchmark, AutoInject establishes stronger baselines for automated prompt-injection success.
The study emphasizes black-box feasibility, highlighting risk for production agents exposed to untrusted inputs.

Why it matters

Automation removes the need for human-crafted payloads, making prompt-injection attacks more scalable and repeatable.
Transferable suffixes reduce attacker cost, increasing risk for agents deployed across heterogeneous model stacks.
Defenders need to evaluate robustness against optimized, utility-preserving attacks—not just obvious malicious prompts.

What to do

Red-team with automated generators to probe agent pipelines at scale, not just hand-made prompts.
Instrument input provenance and isolate untrusted content before it reaches high-privilege tool calls.
Measure utility-preserving attacks in evals to avoid defenses that only catch overtly malicious prompts.

arXiv — Learning to Inject: automated prompt injection via reinforcement learning

Why it matters

What to do

Links