arXiv — Domain-Camouflaged Prompt Injections Evade LLM Injection Detectors
AI relevance: injection detectors — including production-grade safety classifiers like Llama Guard 3 — fail almost entirely when prompt injection payloads are camouflaged to mimic the domain vocabulary and authority patterns of the documents they target.
A new arXiv preprint, submitted to EMNLP 2026 ARR, formalizes the Camouflage Detection Gap (CDG) — the drop in injection detection rate between static template payloads and domain-camouflaged payloads. The authors built a payload generator that rewrites injection directives to sound like native content from the target domain (legal, medical, technical), then evaluated across 45 tasks, three domains, and two model families.
Key findings
- On Llama 3.1 8B, detection drops from 93.8% (static) to 9.7% (camouflaged) — a gap of 84.1 percentage points (χ² = 38.03, p < 0.001).
- On Gemini 2.0 Flash, detection drops from 100% to 55.6% (χ² = 17.05, p < 0.001).
- Llama Guard 3 detects zero camouflage payloads (IDR = 0.000), confirming the blind spot extends beyond few-shot detectors to dedicated safety classifiers.
- Multi-agent debate architectures amplify static injection attacks by up to 9.9× on smaller models — collective resistance on stronger models offers only partial protection.
- Targeted detector augmentation helps: 10.2% improvement on Llama, 78.7% on Gemini — but the vulnerability is architectural, not incidental, for weaker models.
Why it matters
Organizations deploying injection detectors (Llama Guard, custom classifiers, rule-based filters) likely assume these layers catch the majority of prompt injection attempts. This paper shows that a motivated attacker who tailors payloads to the target domain's language can slip past even production-grade safety classifiers. The multi-agent debate finding is especially concerning: adding more agents for "robustness" can actively make smaller models more vulnerable to injection.
What to do
- Test detectors against camouflaged payloads, not just static "ignore previous instructions" templates.
- Don't rely on a single detector layer. Combine classifiers with runtime guardrails that inspect tool-call intent, not just prompt content.
- Be cautious with multi-agent debate on smaller models. It can amplify, not mitigate, injection success.
- Use the released framework and payload generator to stress-test your own agent pipelines.