Images That Hijack AI: Visual Prompt Injection Achieves 90% Attack Success

Text-based prompt injection is well understood at this point. You sanitize inputs, filter suspicious patterns, and hope your model doesn’t follow instructions hidden in documents. But what happens when the attack comes through images?

Recent research from the Cloud Security Alliance and multiple academic teams shows that multimodal LLMs—GPT-4V, Claude 3, Gemini, and others—are vulnerable to a class of attacks that bypasses text-layer defenses entirely. Hidden instructions embedded in images can hijack model behavior, and current defenses don’t fully neutralize the threat.

Four Ways to Weaponize Images

The CSA research note published March 8 identifies four distinct techniques:

1. Typographic text injection — Instructions rendered as visible text within images. Simple but effective: 64% success rate against GPT-4V, Claude 3, and Gemini in black-box testing.

2. Steganographic encoding — Instructions hidden in least-significant bits or frequency-domain modifications. Lower success rate (31.8%) but nearly invisible to humans. The tested payloads achieved visual similarity scores of PSNR 38.4 dB and SSIM 0.945—essentially indistinguishable from clean images.

3. Adversarial perturbations — Optimized noise patterns that manipulate vision encoder representations without readable text. These attacks are the hardest to detect because there’s nothing human-readable in the image.

4. Physical-world signage — Text embedded on real objects within a model’s field of view. Research published in January 2026 demonstrated this against autonomous driving assistants, where signs and packaging could carry hidden instructions.

Mind Maps: 90% Success Rate

Researchers in the MDPI Electronics paper introduced a particularly clever attack: embedding malicious instructions in mind map images.

The technique exploits how LLMs process incomplete information. A mind map naturally contains gaps—topics without explanations, branches without details. When a multimodal LLM encounters an intentionally incomplete mind map, it tries to fill in missing sections. If those gaps are structured around malicious instructions, the model generates unauthorized outputs while completing what it perceives as a helpful task.

The attack success rate? 90%, compared to a maximum of 30.5% for baseline methods. That’s a threefold improvement over previous visual injection techniques.

The limitation: this only works against multimodal LLMs capable of image analysis. But that’s most frontier models now.

What Makes This Dangerous

Text-based input sanitization doesn’t help. The malicious content never passes through text filters—it’s extracted from image processing pathways that operate on different representations.

The academic research presented at FLLM 2025 demonstrated a pipeline with three innovations that make detection harder:

Segmentation-based region selection — Hiding instructions in specific image areas that the model will process
Adaptive font scaling — Text sized to be readable by machine vision but not obvious to human inspection
Background-aware rendering — Instructions that blend with surrounding visual context

Under stealth constraints, this approach achieved approximately 64% success against GPT-4-turbo.

Which Models Are Vulnerable?

The CSA research tested GPT-4V, Claude 3, Gemini, and LLaVA. All showed some level of susceptibility. GPT-4V and Gemini showed the highest vulnerability rates, while Claude 3 demonstrated partial resistance—but not immunity.

Medical imaging presents a particular concern. Research in Nature Communications tested Claude-3 Opus, Claude-3.5 Sonnet, Reka Core, and GPT-4o in healthcare contexts. All four models were susceptible to sub-visual prompts embedded in medical imaging data, producing harmful outputs that weren’t obvious to human observers.

No Complete Defense Exists

The CSA concludes bluntly: “no current defense fully neutralizes all image-based injection variants.”

Their recommended approach combines multiple layers:

Input-layer detection — Scan uploaded images for suspicious patterns, embedded text, or statistical anomalies
Architectural privilege minimization — Limit what vision-capable models can do after processing images
Runtime monitoring — Watch for behavioral changes after image inputs
Mandatory human review — Keep humans in the loop for high-stakes decisions

The problem is that each layer catches different attack types. Typographic injection might be detectable, but steganographic encoding or adversarial perturbations won’t trigger text-scanning defenses.

What This Means for Users

If you’re using multimodal AI tools that accept image uploads:

Be cautious about image sources. A document from an untrusted source could contain visual instructions alongside its visible content. The model might follow hidden commands while appearing to analyze the document normally.

Watch for unexpected behavior. If a model suddenly takes actions you didn’t request after processing an image, that’s a red flag. The mind map attack, for instance, works by tricking models into generating outputs as part of their “completion” behavior.

Don’t assume safety filters help. OWASP ranks prompt injection as the #1 LLM vulnerability in 2025, and image-based variants specifically bypass the text-focused protections that most systems implement.

What This Means for Builders

If you’re building applications with multimodal models:

Image inputs are untrusted inputs. Treat them with the same suspicion you’d give user-submitted text. The fact that content comes through a different modality doesn’t make it safer.

Consider disabling vision where unnecessary. If your application doesn’t need image understanding, don’t expose that capability. Every feature is an attack surface.

Implement output monitoring. Even if you can’t detect injection in inputs, you might catch unexpected outputs. Behavioral anomaly detection is a reasonable backup.

Plan for this to get worse. Research in this area is accelerating. The 90% success rate from mind map injection was published in 2025—techniques will only improve from here.

The Broader Pattern

Visual prompt injection joins a growing list of attack classes that exploit the gap between how AI systems process different input types. Text filters don’t help when attacks come through images. Image filters won’t help when attacks come through audio. The modality-specific nature of current defenses creates seams that attackers can exploit.

The shift to multimodal AI was inevitable and useful. But it expanded the attack surface faster than security practices adapted. Until defenses catch up, the safest assumption is that any input modality your AI can process is a potential injection vector.

What You Can Do

If you’re a user:

Be skeptical of AI analysis of images from untrusted sources
Verify important outputs independently
Report unexpected behavior to platform operators

If you’re a developer:

Implement defense-in-depth: input scanning, privilege limits, output monitoring
Consider image pre-processing that might disrupt embedded payloads
Test your systems against published attack techniques
Document your threat model explicitly

If you’re a researcher:

The field needs better detection methods for steganographic and adversarial variants
Physical-world injection is understudied compared to digital attacks
Defense benchmarks would help practitioners compare approaches

The multimodal future is here. So are multimodal attacks.