CrossMPI — Image-Only Prompt Injection Attacks Multimodal AI Models

AI relevance: CrossMPI demonstrates that multimodal AI systems — increasingly deployed as document processors, AI copilots, and agent vision interfaces — can be hijacked through image perturbations alone, bypassing text-based prompt injection defenses entirely.

What Happened

  • Researchers from Xidian University published a new paper introducing CrossMPI — a technique that uses nearly imperceptible pixel-level image perturbations to alter how large vision-language models (LVLMs) process both visual and textual inputs.
  • Unlike traditional prompt injection (malicious text embedded in prompts or webpages), CrossMPI changes the model's interpretation through image modifications alone, leaving the text prompt untouched.
  • In a demonstrated example, researchers perturbed an airplane image so that when an LVLM was asked whether it belonged to Air Canada, the model misidentified the object as "a mobile phone" — distorting both visual classification and task understanding.
  • The attack targets intermediate fusion layers where visual and textual information combine into hidden state representations — not the final output layers typically studied in adversarial AI.
  • CrossMPI achieved a 66.36% average success rate across MiniGPT4, BLIP-2, InstructBLIP, BLIVA, and Qwen2.5-VL — outperforming prior baselines by ~41 percentage points.
  • The technique showed strong black-box transferability: effective without access to the target model's parameters or architecture.
  • Defenses tested (random resizing, rotation, JPEG compression, SmoothVLM, DPS) weakened but did not eliminate the attack. SmoothVLM reduced success rates below 5% in some scenarios.

Why It Matters

As enterprises rapidly adopt multimodal AI — Gartner predicts 80% of enterprise software will be multimodal by 2030, up from 1% in 2024 — the attack surface expands beyond text to every image, screenshot, PDF, and video stream processed by these systems. Image-only injection bypasses existing text-focused prompt injection defenses, and black-box transferability means a single crafted perturbation can hit multiple model families. Document-processing pipelines, AI copilots that analyze screenshots, and agent vision capabilities are all exposed.

What to Do

  • Apply image sanitization to all inputs for multimodal AI pipelines: JPEG recompression, resizing, and normalization can reduce (but not eliminate) perturbation effectiveness.
  • Evaluate SmoothVLM or similar inference-level defenses for production VLM deployments.
  • Expand threat models for AI agents to cover multimodal inputs — image-based injection is a blind spot in most current prompt injection defenses.
  • Restrict which image sources can flow into vision-enabled agent workflows; treat untrusted images as hostile input.

Sources