Multimodal & Search Everywhere
Search is more than text: image, video, audio and voice count — and AI aggregates visibility across YouTube, Reddit and more. What that means in practice.
by Jean Pierre Kolb ·
Search is long past text-only — and visibility is long past Google alone. According to the GEO guide that also underpins the SEO & GEO Analyzer, more than one in six AI Mode queries in the US arrive without any text — via voice, image, video or real-time conversation (as of May 2026). A site that exists only as a wall of text becomes invisible to this growing slice. I pass that figure on as an order of magnitude from the guide, not as my own measurement. The frame comes from the GEO pillar What is GEO?.
Why multimodal search matters now
Multimodal search matters because people increasingly show instead of type — they photograph a product, speak a question or feed in a clip. The guide names visual queries as a particularly fast-growing segment: image-generation prompts more than tripled in early 2026. That, too, is a figure from the guide, not a value I collected. The consequence is the same for every format: AI can only cite what it can grasp as text. An image, a video or an audio track without accompanying, marked-up text gets shown — but the source stays unnamed.
Every format needs extractable text
Every non-text medium needs a textual bridge so the AI can attribute it to your source. The table below sums up which markup makes which format citable:
| Format | Bridge to citation | Schema |
|---|---|---|
| Image | descriptive filename, meaningful alt, caption | ImageObject (caption, contentUrl, creditText, license) |
| Video | full transcript on the same page | VideoObject (transcript, thumbnailUrl, uploadDate, hasPart for chapters) |
| Audio/podcast | show notes plus a searchable transcript | PodcastEpisode or AudioObject |
- Image and visual entry points — Provide a high-resolution, well-cropped image for every key topic. Without
ImageObjectschema, the back-reference is missing through which a visual hit could attribute you as the source. - Video with transcript, not video alone — An AI system cannot watch a video. What gets quoted is the transcript text on the same page — so embed it, not just the player.
- Audio and podcasts — AI Modes increasingly generate spoken answers. The source becomes whoever provides extractable text around the audio file.
- Voice-friendly lead sentences — Voice queries are longer and more conversational. Lead each section with the answer in one declarative sentence; text-to-speech often reads only the first one or two sentences aloud.
- Real-time and Lens — New entry points (Search Live, Lens overlays) send mixed text-plus-image queries. Name central concepts in text near the image so the model can ground what it sees against your wording.
- Visual identity — Same logo, same product imagery, same brand colors across every platform. Visual entity recognition treats recurring imagery as a brand signal.
Search Everywhere: visibility beyond the search engine
Search Everywhere means AI assembles information from across the whole web — not just Google. Your visibility therefore depends on whether you are present where people talk:
- Platform presence — Be on YouTube, Reddit, TikTok, LinkedIn and industry forums. AI systems crawl exactly these sources.
- Social proof — Reviews, mentions and discussions on third-party platforms raise your citation likelihood.
- Consistent identity — Same brand name, same descriptions and key messages everywhere — that is the precondition for clean entity recognition.
- Community engagement — Active participation in relevant communities (Reddit, Stack Overflow, industry forums) builds mention-based authority.
The practical yardstick: test every important page with three questions — Is there a quotable image with ImageObject schema? Does a transcript for video or audio sit on the same URL? Does the first sentence read cleanly out loud? Three yeses means: visible to multimodal search.
FAQ
Do I now have to produce videos and podcasts everywhere for GEO?
No. You do not have to fill every format, but every format you do use should be made citable. The textual bridge matters more than sheer volume: a transcript for the video, show notes for the podcast, a marked-up caption for the image. A few cleanly marked-up media beat many that stand around without text.
Is a good alt attribute enough to get images cited?
A good alt attribute is the must, not the bonus. It describes the image but gives the AI no structured back-reference to source, creator and license. Only ImageObject schema with caption, contentUrl and creditText turns a displayed image into an attributable source. Without schema you see your image in the answer — but not your name.
Is Reddit or TikTok worth it for a small specialist business?
It depends on where your audience asks questions — but do not underestimate it. Because AI actively draws on forums and social platforms as sources, a well-founded answer in a specialist forum can be cited long before anyone visits your website. The key is consistent identity: same name, same key messages, so the mentions feed your entity profile.
Further reading
The frame comes from the GEO pillar What is GEO?. The schema markup for image, video and audio is deepened in Structured Data and Technical GEO. Voice-friendly lead sentences are covered in Writing for AI. Check the state of your site with the SEO & GEO Analyzer.