CrossMPI — Image-Only Prompt Injection Attacks Multimodal AI Models

2026-05-18 Security by al-ice.ai Editorial

AI relevance: CrossMPI demonstrates that multimodal AI systems — increasingly deployed as document processors, AI copilots, and agent vision interfaces — can be hijacked through image perturbations alone, bypassing text-based prompt injection defenses entirely.

What Happened

Researchers from Xidian University published a new paper introducing CrossMPI — a technique that uses nearly imperceptible pixel-level image perturbations to alter how large vision-language models (LVLMs) process both visual and textual inputs.
Unlike traditional prompt injection (malicious text embedded in prompts or webpages), CrossMPI changes the model's interpretation through image modifications alone, leaving the text prompt untouched.
In a demonstrated example, researchers perturbed an airplane image so that when an LVLM was asked whether it belonged to Air Canada, the model misidentified the object as "a mobile phone" — distorting both visual classification and task understanding.
The attack targets intermediate fusion layers where visual and textual information combine into hidden state representations — not the final output layers typically studied in adversarial AI.
CrossMPI achieved a 66.36% average success rate across MiniGPT4, BLIP-2, InstructBLIP, BLIVA, and Qwen2.5-VL — outperforming prior baselines by ~41 percentage points.
The technique showed strong black-box transferability: effective without access to the target model's parameters or architecture.
Defenses tested (random resizing, rotation, JPEG compression, SmoothVLM, DPS) weakened but did not eliminate the attack. SmoothVLM reduced success rates below 5% in some scenarios.

Why It Matters

As enterprises rapidly adopt multimodal AI — Gartner predicts 80% of enterprise software will be multimodal by 2030, up from 1% in 2024 — the attack surface expands beyond text to every image, screenshot, PDF, and video stream processed by these systems. Image-only injection bypasses existing text-focused prompt injection defenses, and black-box transferability means a single crafted perturbation can hit multiple model families. Document-processing pipelines, AI copilots that analyze screenshots, and agent vision capabilities are all exposed.

What to Do

Apply image sanitization to all inputs for multimodal AI pipelines: JPEG recompression, resizing, and normalization can reduce (but not eliminate) perturbation effectiveness.
Evaluate SmoothVLM or similar inference-level defenses for production VLM deployments.
Expand threat models for AI agents to cover multimodal inputs — image-based injection is a blind spot in most current prompt injection defenses.
Restrict which image sources can flow into vision-enabled agent workflows; treat untrusted images as hostile input.

CrossMPI — Image-Only Prompt Injection Attacks Multimodal AI Models

What Happened

Why It Matters

What to Do

Sources