Why Google is deleting reviews at record levels
Google integrates Maps into Demand Gen
Balance between AI and basic SEO
7 minutes
Images are no longer a secondary element of content.
For modern multimodal models, they are processed almost in the same way as text.
Optical character recognition (OCR), visual context, and pixel-level quality directly affect how AI systems interpret, summarize, and surface content in search.
Over the past decade, image SEO was largely reduced to technical hygiene:
These practices remain foundational, but the emergence of large multimodal models—such as ChatGPT and Gemini—has fundamentally changed the logic of optimization.
Multimodal search embeds different content types into a shared vector space.
Text, images, video, and audio are no longer processed in isolation—they become parts of a single semantic environment.
In practice, we are no longer optimizing pages solely for users.
We are optimizing them for the “machine gaze.”
Generative search makes nearly all content machine-readable:
media assets are segmented into chunks, and text is extracted from images using OCR.
If an AI system cannot accurately read text on product packaging due to low contrast, or “hallucinates” details because of poor resolution, this is no longer a minor technical issue—it is a direct visibility problem in search.
Before optimizing images for machine understanding, they must pass a basic gatekeeper: performance.
Images remain a double-edged sword:
Today, the standard of “good enough” has long moved beyond the WebP format.
However, once the asset has loaded, the real work only begins.
For large language models, images are a source of structured data.
They rely on a process known as visual tokenization, breaking an image into a grid of patches (visual tokens) and converting pixels into a sequence of vectors.
This is what allows AI systems to process a phrase such as
“an image of a cup on a table”
as a single coherent semantic construct, rather than as a collection of separate elements.
OCR plays a central role in this process.
It is precisely at this stage that image quality begins to influence the “ranking” of interpretation.
Heavy compression with visible artifacts creates noisy visual tokens.
Low resolution forces the model to misinterpret patches, which can result in hallucinations—situations where the AI confidently describes text or objects that do not actually exist.
For large language models, alt text now serves a new function: grounding.
It acts as a semantic anchor that helps the model correctly align visual tokens with textual ones and reduce ambiguity in interpretation.
Research by Zhang, Zhu, and Tambe demonstrates that inserting text tokens near relevant visual patches strengthens cross-modal attention and enables the model to interpret content more accurately.
The practical takeaway is straightforward:
by describing the physical characteristics of an image—lighting, composition, text placement, and materials—you effectively provide high-quality data for training the machine gaze.
Search agents such as Google Lens or Gemini actively use OCR to read:
As a result, image SEO extends beyond the website and now includes the product’s physical packaging.
Regulatory standards (FDA 21 CFR 101.2, EU 1169/2011) allow extremely small font sizes—ranging from 4.5 to 6 pt, or approximately 0.9 mm.
While this satisfies human readability requirements, it is insufficient for machine interpretation.
For reliable OCR performance:
Glossy surfaces introduce an additional layer of complexity. Reflections and glare can completely obscure text.
Packaging should therefore be treated not only as a design element, but as a machine-readability feature.
If an AI system cannot accurately read a product photo, it will either hallucinate information or exclude the product from its response altogether.
Multimodal AI systems identify every object within an image and analyze the relationships between them in order to infer brand attributes, price positioning, and target audience.
As a result, the adjacency of a product to other objects becomes a standalone ranking signal.
To evaluate this signal, it is necessary to conduct an audit of the visual entities present across a brand’s media assets.
Initial testing can be performed using tools such as the Google Vision API.
For a systematic analysis of an entire image library, raw JSON responses must be extracted using the OBJECT_LOCALIZATION feature.
The API returns lists of detected objects with corresponding labels, such as watch, plastic bag, or disposable cup.
Google’s official documentation provides an example response structure that includes the following parameters:
It is important to note that the API does not determine whether the detected context is positive or negative.
That interpretation must be made by the brand owner or SEO specialist.
For this reason, it is critical to verify whether the product’s visual “neighbors” communicate the same narrative as its positioning and price point.
Consider the example of the Lord Leathercraft brand and a blue leather watch strap.
By photographing the watch alongside a vintage brass compass and a warm wood-grain surface, the brand constructs a clear semantic signal: heritage, exploration, and classic values.
The co-occurrence of analog mechanics, aged metal, and tactile suede allows AI systems to infer a persona of timeless adventure and refined, old-world sophistication.
However, if the same watch is photographed next to a neon energy drink and a plastic digital stopwatch, the semantic narrative shifts due to dissonance.
The visual context begins to signal mass-market utility, directly reducing the perceived value of the entity.
Thus, object co-occurrence influences not only product interpretation, but also its competitive positioning in AI-driven search.
Beyond objects, modern models are increasingly capable of interpreting emotional signals from images.
APIs such as Google Cloud Vision can quantify emotional attributes by assigning likelihood levels to states such as joy, sorrow, anger, and surprise based on facial analysis.
This introduces a new optimization vector: emotional alignment with search intent.
If a brand sells light, cheerful summer clothing, but its imagery conveys neutral or melancholic moods—a common trope in high-fashion photography—AI systems may deprioritize those images for relevant queries due to a conflict between visual sentiment and user intent.
For a quick, no-code assessment, Google Cloud Vision’s live drag-and-drop demo can be used to review the four primary emotions.
For positive scenarios, such as the query “happy family dinner,” the joy attribute should register as VERY_LIKELY.
Values such as POSSIBLE or UNLIKELY indicate that the signal is too weak for the system to confidently index the image as emotionally positive.
For deeper analysis, the following steps are recommended:
FACE_DETECTION feature;faceAnnotations object in the JSON response;The API returns emotional assessments using fixed categories:
The primary optimization objective is to move key images from POSSIBLE to LIKELY or VERY_LIKELY for the target emotion.
Emotional resonance cannot be optimized if the AI system cannot reliably detect a human face.
If the detectionConfidence score falls below 0.60, the model struggles with identification, and all emotion-related readings become statistically unreliable.
Practical benchmarks include:
While Google does not publish explicit threshold guidance, Amazon Rekognition documentation notes that lower thresholds (around 80%) may be acceptable in certain scenarios, such as identifying known individuals in photos.
Visual assets must be treated with the same editorial rigor and strategic intent as primary textual content.
The semantic gap between images and text is rapidly disappearing.
Images are processed as part of the language sequence, not as supplementary illustrations.
The quality, clarity, and semantic accuracy of the pixels themselves now carry the same weight as the keywords on the page.
Read this article in Ukrainian.
Say hello to us!
A leading global agency in Clutch's top-15, we've been mastering the digital space since 2004. With 9000+ projects delivered in 65 countries, our expertise is unparalleled.
Let's conquer challenges together!
performance_marketing_engineers/
performance_marketing_engineers/
performance_marketing_engineers/
performance_marketing_engineers/
performance_marketing_engineers/
performance_marketing_engineers/
performance_marketing_engineers/
performance_marketing_engineers/