Why 2026 belongs to multimodal AI
For the past three years, AI’s breakout moment has happened almost entirely through text. We type a prompt, get a response, and move to the next task. While this intuitive interaction style turned chatbots into a household tool overnight, it barely scratches the surface of what the most advanced technology of our time can actually do.
This disconnect has created a significant gap in how consumers utilize AI. While the underlying models are rapidly becoming multimodal—capable of processing voice, visuals, and video in real time—most consumers are still using them as a search engine. Looking toward 2026, I believe the next wave of adoption won’t be about utility alone, but about evolving beyond static text into dynamic, immersive interactions. This is AI 2.0: not just retrieving information faster, but experiencing intelligence through sound, visuals, motion, and real-time context.
AI adoption has reached a tipping point. In 2025, ChatGPT’s weekly user base doubled from roughly 400 million in February to 800 million by year’s end. Competitors like Gemini and Anthropic saw similar growth, yet most users still engage with LLMs primarily via text chatbots. In fact, Deloitte’s Connected Consumer Survey shows that despite over half (53%) of consumers experimenting with generative AI, most people still relegate AI to administrative tasks like writing, summarizing, and researching.
Yet when you look at the digital behavior of consumers outside of AI, it’s clear consumers crave immersive experiences. According to Activate Consulting’s Tech & Media Outlook 2026, 43% of Gen Z prefer user-generated platforms like TikTok and YouTube over traditional TV or paid streaming, and they spend 54% more time on social video platforms than the average consumer, abandoning traditional media for interactive social platforms.
This creates a fundamental mismatch: Consumers live in a multi-sensory world, but their AI tools are stuck delivering plain text. While the industry recognizes this gap and is investing to close it, I predict we’ll see a fundamental shift in how people use and create with AI. In AI 2.0, users will no longer simply consume AI-generated content but will instead leverage multimodal AI to bring voice, visuals, and text together, allowing them to shape and direct their experiences in real time.
MULTIMODAL AI UNLOCKS IMMERSIVE STORYTELLING
If AI 1.0 was about efficiency, AI 2.0 is about engagement. While text-based AI is limited in how deeply it can engage audiences, multimodal AI allows the user to become an active participant. Instead of reading a story, you can interact with a main character and take the plot in a new direction or build your own world where narratives and characters evolve with you.
We can look to the $250 billion gaming industry as the blueprint for the potential that multimodal AI has. Video games combine visuals, audio, narrative, and real-time agency, creating an immersive experience that traditional entertainment can’t replicate. Platforms like Roblox and Minecraft let players inhabit content. Roblox alone reaches over 100 million daily users, who collectively spend tens of billions of hours a year immersed in these worlds; engagement that text alone could never generate.
With the rise of multimodal AI, users everywhere will be able to create these types of experiences they’ve loved to participate in through gaming. By removing technical barriers, multimodal allows everyone to build experiences that not only feel authentic to the real world but also actively participate in them. Legacy media is also responding to this trend. Disney recently announced a $1 billion investment in OpenAI and a licensing deal that will let users create short clips with characters from Marvel, Pixar, and Star Wars through the Sora platform.
WHY MULTIMODAL AI CAN BE SAFER FOR YOUNGER USERS
As AI becomes part of everyday life, safety—particularly for younger users—has become one of the most critical issues facing the industry.
Moving from open-ended chat to structured, multimodal worlds allows us to design guardrails within the gameplay. Instead of relying on continuous unstructured prompts, these environments are built around characters, visuals, voices, and defined story worlds. Interaction is guided by the experience itself. That structure changes how and where safety is designed into the system.
Educational AI demonstrates this approach. Platforms like Khan Academy Kids and Duolingo combine visuals, audio, and structured prompts to guide learning. The AI isn’t trying to be everything; it focuses on one task well. As multimodal AI evolves, one of its most meaningful opportunities may be this ability to balance creative freedom with thoughtful constraint. AI 2.0 presents a design shift that could give builders, educators, and families new ways to shape safer, more intentional digital spaces for the next generation.
WHY MULTIMODAL AI IS THE NEXT FRONTIER
In 2026, I predict that consumers won’t be prompting AI; it will be a more immersive interactive experience. This excites me because users won’t just passively receive outputs; they’ll actively shape experiences and influence how AI evolves in real time. We could see users remixing the series finale of their favorite TV show, or students learning history not by reading a textbook, but by actively debating a historically accurate AI simulation.
For founders and creators, the next step is to stop building tools only for efficiency and start building environments for immersion and exploration. The winners of the next cycle won’t be the ones with the smartest models, but the ones who make AI feel less like a utility and more like a destination for rich, interactive experiences.
Karandeep Anand is CEO of Character.AI