How multimodal models interpret images

Images are no longer a secondary element of content.
For modern multimodal models, they are processed almost in the same way as text.

Optical character recognition (OCR), visual context, and pixel-level quality directly affect how AI systems interpret, summarize, and surface content in search.

Over the past decade, image SEO was largely reduced to technical hygiene:

compressing images to improve load speed;
writing alt text for accessibility;
implementing lazy loading to keep LCP metrics in the “green” zone.

These practices remain foundational, but the emergence of large multimodal models—such as ChatGPT and Gemini—has fundamentally changed the logic of optimization.

From speed optimization to AI interpretability

Multimodal search embeds different content types into a shared vector space.
Text, images, video, and audio are no longer processed in isolation—they become parts of a single semantic environment.

In practice, we are no longer optimizing pages solely for users.
We are optimizing them for the “machine gaze.”

Generative search makes nearly all content machine-readable:
media assets are segmented into chunks, and text is extracted from images using OCR.

If an AI system cannot accurately read text on product packaging due to low contrast, or “hallucinates” details because of poor resolution, this is no longer a minor technical issue—it is a direct visibility problem in search.

Technical hygiene still matters, but it is no longer sufficient

Before optimizing images for machine understanding, they must pass a basic gatekeeper: performance.

Images remain a double-edged sword:

they increase engagement;
yet they are often the primary cause of slow load times and layout instability.

Today, the standard of “good enough” has long moved beyond the WebP format.
However, once the asset has loaded, the real work only begins.

Pixel-level readability as a factor of interpretation

For large language models, images are a source of structured data.

They rely on a process known as visual tokenization, breaking an image into a grid of patches (visual tokens) and converting pixels into a sequence of vectors.

This is what allows AI systems to process a phrase such as
“an image of a cup on a table”
as a single coherent semantic construct, rather than as a collection of separate elements.

OCR plays a central role in this process.
It is precisely at this stage that image quality begins to influence the “ranking” of interpretation.

Heavy compression with visible artifacts creates noisy visual tokens.
Low resolution forces the model to misinterpret patches, which can result in hallucinations—situations where the AI confidently describes text or objects that do not actually exist.

Alt text as a semantic grounding mechanism

For large language models, alt text now serves a new function: grounding.

It acts as a semantic anchor that helps the model correctly align visual tokens with textual ones and reduce ambiguity in interpretation.

Research by Zhang, Zhu, and Tambe demonstrates that inserting text tokens near relevant visual patches strengthens cross-modal attention and enables the model to interpret content more accurately.

The practical takeaway is straightforward:
by describing the physical characteristics of an image—lighting, composition, text placement, and materials—you effectively provide high-quality data for training the machine gaze.

OCR audit: where images break for AI

Search agents such as Google Lens or Gemini actively use OCR to read:

product ingredients;
instructions;
specifications directly from images.

As a result, image SEO extends beyond the website and now includes the product’s physical packaging.

Regulatory standards (FDA 21 CFR 101.2, EU 1169/2011) allow extremely small font sizes—ranging from 4.5 to 6 pt, or approximately 0.9 mm.
While this satisfies human readability requirements, it is insufficient for machine interpretation.

For reliable OCR performance:

character height should be at least 30 pixels;
contrast should be no lower than 40 grayscale values;
stylized or decorative fonts introduce a high risk of recognition errors.

Glossy surfaces introduce an additional layer of complexity. Reflections and glare can completely obscure text.

Packaging should therefore be treated not only as a design element, but as a machine-readability feature.
If an AI system cannot accurately read a product photo, it will either hallucinate information or exclude the product from its response altogether.

Co-occurrence audit

Multimodal AI systems identify every object within an image and analyze the relationships between them in order to infer brand attributes, price positioning, and target audience.
As a result, the adjacency of a product to other objects becomes a standalone ranking signal.

To evaluate this signal, it is necessary to conduct an audit of the visual entities present across a brand’s media assets.

Initial testing can be performed using tools such as the Google Vision API.
For a systematic analysis of an entire image library, raw JSON responses must be extracted using the OBJECT_LOCALIZATION feature.

The API returns lists of detected objects with corresponding labels, such as watch, plastic bag, or disposable cup.
Google’s official documentation provides an example response structure that includes the following parameters:

Name — the object label;
mid — a machine-generated identifier (MID) corresponding to a Google Knowledge Graph entity;
Score — the model’s confidence level;
Bounds — the object’s coordinates within the image.

It is important to note that the API does not determine whether the detected context is positive or negative.
That interpretation must be made by the brand owner or SEO specialist.

For this reason, it is critical to verify whether the product’s visual “neighbors” communicate the same narrative as its positioning and price point.

Semantic consistency in a product context

Consider the example of the Lord Leathercraft brand and a blue leather watch strap.

By photographing the watch alongside a vintage brass compass and a warm wood-grain surface, the brand constructs a clear semantic signal: heritage, exploration, and classic values.
The co-occurrence of analog mechanics, aged metal, and tactile suede allows AI systems to infer a persona of timeless adventure and refined, old-world sophistication.

However, if the same watch is photographed next to a neon energy drink and a plastic digital stopwatch, the semantic narrative shifts due to dissonance.
The visual context begins to signal mass-market utility, directly reducing the perceived value of the entity.

Thus, object co-occurrence influences not only product interpretation, but also its competitive positioning in AI-driven search.

Quantifying emotional resonance

Beyond objects, modern models are increasingly capable of interpreting emotional signals from images.
APIs such as Google Cloud Vision can quantify emotional attributes by assigning likelihood levels to states such as joy, sorrow, anger, and surprise based on facial analysis.

This introduces a new optimization vector: emotional alignment with search intent.

If a brand sells light, cheerful summer clothing, but its imagery conveys neutral or melancholic moods—a common trope in high-fashion photography—AI systems may deprioritize those images for relevant queries due to a conflict between visual sentiment and user intent.

For a quick, no-code assessment, Google Cloud Vision’s live drag-and-drop demo can be used to review the four primary emotions.
For positive scenarios, such as the query “happy family dinner,” the joy attribute should register as VERY_LIKELY.

Values such as POSSIBLE or UNLIKELY indicate that the signal is too weak for the system to confidently index the image as emotionally positive.

Comprehensive emotional signal auditing

For deeper analysis, the following steps are recommended:

process a batch of images through the API;
submit requests using the FACE_DETECTION feature;
analyze the faceAnnotations object in the JSON response;
review the likelihood fields.

The API returns emotional assessments using fixed categories:

UNKNOWN — insufficient data;
VERY_UNLIKELY — strong negative signal;
UNLIKELY;
POSSIBLE — neutral or ambiguous signal;
LIKELY;
VERY_LIKELY — strong positive signal (target level).

The primary optimization objective is to move key images from POSSIBLE to LIKELY or VERY_LIKELY for the target emotion.

Detection quality thresholds

Emotional resonance cannot be optimized if the AI system cannot reliably detect a human face.

If the detectionConfidence score falls below 0.60, the model struggles with identification, and all emotion-related readings become statistically unreliable.

Practical benchmarks include:

0.90+ (ideal) — high resolution, front-facing orientation, proper lighting;
0.70–0.89 (acceptable) — sufficient for background faces or secondary lifestyle imagery;
< 0.60 (insufficient) — faces are too small, blurred, in profile, or obscured by shadows or eyewear.

While Google does not publish explicit threshold guidance, Amazon Rekognition documentation notes that lower thresholds (around 80%) may be acceptable in certain scenarios, such as identifying known individuals in photos.

Closing the semantic gap between pixels and meaning

Visual assets must be treated with the same editorial rigor and strategic intent as primary textual content.

The semantic gap between images and text is rapidly disappearing.
Images are processed as part of the language sequence, not as supplementary illustrations.

The quality, clarity, and semantic accuracy of the pixels themselves now carry the same weight as the keywords on the page.

Read this article in Ukrainian.

Digital marketing puzzles making your head spin?

Say hello to us!
A leading global agency in Clutch's top-15, we've been mastering the digital space since 2004. With 9000+ projects delivered in 65 countries, our expertise is unparalleled.
Let's conquer challenges together!

Digital marketing SEO

image SEO SEO

Read also:

Table of contents

From speed optimization to AI interpretability

Technical hygiene still matters, but it is no longer sufficient

Pixel-level readability as a factor of interpretation

Alt text as a semantic grounding mechanism

OCR audit: where images break for AI

Co-occurrence audit

Semantic consistency in a product context

Quantifying emotional resonance

Comprehensive emotional signal auditing

Detection quality thresholds

Closing the semantic gap between pixels and meaning

Digital marketing puzzles making your head spin?

Hot articles

Read more

How multimodal models interpret images

Table of contents

From speed optimization to AI interpretability

Technical hygiene still matters, but it is no longer sufficient

Pixel-level readability as a factor of interpretation

Alt text as a semantic grounding mechanism

OCR audit: where images break for AI

Co-occurrence audit

Semantic consistency in a product context

Quantifying emotional resonance

Comprehensive emotional signal auditing

Detection quality thresholds

Closing the semantic gap between pixels and meaning

Digital marketing puzzles making your head spin?

Hot articles

Read more

Our services