InstaTunnel — Agent hijacking and intent breaking: the goal-oriented attack surface

2026-02-02 • Category: Security

AI relevance: Intent breaking targets the internal reasoning loop of autonomous AI agents — not their safety filters — turning tool-calling agents into unwitting proxies that execute attacker-chosen actions while believing they are fulfilling their original goal.

New attack taxonomy: InstaTunnel distinguishes "intent breaking" from classic prompt injection. Rather than overriding an LLM's output filter, the attacker manipulates intermediate sub-goals inside the agent's ReAct (Reason + Act) loop so the agent still believes it is on-task.
Indirect prompt injection (IPI) as the primary vector: because agents browse the web, read emails, and parse documents to complete tasks, adversaries embed hidden instructions in those data sources — e.g., white-on-white text in a PDF resume that tells an HR agent to mark a candidate as "Highly Recommended" and trigger admin-access provisioning.
Intermediate goal displacement: an attacker poisons a review site or reference document the agent consults mid-task, injecting fake compliance requirements — for example, routing a procurement agent through a "Global-Verify Gateway" that is actually attacker infrastructure.
Tool-use hijacking: once an agent's intent is broken, its connected tools (Python interpreters, SQL executors, API integrations) become an attacker-controlled RCE surface. The LLM effectively functions as a Remote Code Execution engine with legitimate credentials.
Why I/O guardrails fail: traditional content-safety filters look for overtly malicious language. Intent-breaking payloads are semantically legitimate ("use this more efficient vendor"), contextually ambiguous (filters cannot distinguish real requirements from injected ones), and persistent across multi-step reasoning loops.
State persistence amplifies impact: in multi-step agentic workflows, a single poisoned intermediate step can propagate through the entire task chain — the agent carries the corrupted reasoning forward into subsequent tool calls without re-evaluation.
The article frames the shift from chatbot-era "content safety" risk to agent-era "execution safety" risk — attacks that don't just produce bad text but cause real-world actions (transfers, access grants, data exfiltration).

Why it matters

Most enterprise agent deployments still rely on input/output filtering inherited from chatbot-era safety tooling. Intent breaking bypasses this entire layer by operating within the reasoning loop, not against it.
As agents gain broader tool access (file systems, APIs, databases, code execution), a compromised intermediate goal grants the attacker the full permission scope of the agent — often far wider than any single user account.
The attack is difficult to detect in logs because each individual action appears legitimate; only the sequence of sub-goals reveals the manipulation.

What to do

Validate intermediate goals, not just inputs: implement goal-level invariant checks that compare each sub-task against the original high-level objective before tool execution.
Treat external data as untrusted: sandbox all content the agent ingests (web pages, emails, documents) and strip or neutralize embedded instructions before they enter the reasoning context.
Enforce tool-call authorization boundaries: require explicit approval or secondary confirmation for high-impact actions (financial transfers, access provisioning, data export) regardless of the agent's internal confidence.
Log and audit reasoning traces: capture the full chain of intermediate goals and tool calls so security teams can reconstruct the reasoning path and identify displacement after the fact.
Limit agent permission scope: apply least-privilege to every tool binding — an agent performing procurement should not have admin-provisioning capabilities, even if the underlying LLM "reasons" that it's required.

InstaTunnel — Agent hijacking and intent breaking: the goal-oriented attack surface

Why it matters

What to do

Links