Google AI Introduces Natively Adaptive Interfaces (NAI): An Agentic Multimodal Accessibility Framework Built on Gemini for Adaptive UI Design

Google Research proposes a new way to create accessible software with Natively Adaptive Interfaces (NAI), an agent framework where a multimodal AI agent becomes the main user interaction and adapts the application in real time to the skills and capabilities of each user.
Instead of deploying a focused UI and adding accessibility as a separate layer, NAI pushes accessibility into the core structure. The agent observes, reasons, and adjusts the interface itself, moving from one-size-fits-all design to context-aware decisions.
What Changes Native Adaptive Connections (NAI) Changes in the Stack?
NAI starts with a simple premise: if the interface is mediated by a multimodal agent, accessibility can be handled by that agent instead of static menus and settings.
Key features include:
- I a multimodal AI agent the starting point of the UI. It can see text, images, and structures, listen to speech, and output text, speech, or other methods.
- Access is integrated into this agent from the beginning, not to be arrested later. The agent is responsible for adjusting the navigation, content density, and presentation style for each user.
- The design process is clear user-centricand people with disabilities who are treated as end users who define everyone else’s requirements, not as an afterthought.
The framework addresses what the Google team calls the ‘accessibility gap’ – the gap between adding new product features and making them accessible to people with disabilities. Embedding agents into the interface is intended to reduce this gap by allowing the system to adapt without waiting for custom additions.
Agent Architecture: Orchestrator and Special Tools
Under NAI, the UI is supported by a multi-agent system. The main pattern is:
- An The Orchestrator the agent stores shared content about the user, task, and state of the application.
- Special sub-agents use focused skills, such as summarizing or adapting settings.
- A set of configuration patterns explains how to identify user intent, add relevant context, adjust settings, and correct erroneous queries.
For example, in NAI’s courses on accessible video, The Google team describes key agent capabilities such as:
- Understand user intent.
- Refine questions and bring context to every opportunity.
- Developer information and tool calls in a consistent manner.
From a systems perspective, this replaces static navigation trees dynamic, agent-driven modules. The ‘navigation model’ is effectively a policy of which sub-agent should run, in which context, and how to return its result to the UI.
Multimodal Gemini and RAG for Video and Locations
NAI is clearly built on multimodal models such as Gemini and Gemma that can process voice, text, and images in a single context.
In the case of accessible video, Google describes a two-step pipeline:
- Offline guide
- The program generates dense visual descriptors and dictionaries during the video.
- These descriptions are stored in a directory with a time and content key.
- Internet Recovery Advanced Generation (RAG)
- During gameplay, if the user asks a question like “What is the character wearing right now?”, the system gets the relevant explanations.
- The multimodal model conditions in these descriptors and the question of generating a short, descriptive answer.
This design supports interactive queries during playback, not just pre-recorded audio description tracks. The same pattern generalizes in real-world navigation situations where an agent needs to think through a series of observations and user queries.
Concrete NAI Prototypes
Google’s NAI research work is based on several used or tested prototypes developed with partner organizations such as RIT/NTID, The Arc of the United States, RNID, and Team Gleason.
StreetReaderAI
- Designed for blind and partially sighted users navigating urban environments.
- It includes i AI Descriptor which processes camera and geospatial data through AI discussion natural language query interface.
- It maintains a temporal model of the environment, allowing questions such as ‘Where was that bus stop?’ and replies like ‘It’s behind you, about 12 meters.’
Multimodal Agent Video Player (MAVP)
- It focuses on the accessibility of online video.
- It uses the Gemini-based RAG pipeline above to render variable sound descriptors.
- It allows users to control descriptive density, interrupt play with questions, and find answers based on indexed visual content.
The Grammar Laboratory
- Bilingual learning platform (American Sign Language and English) created by RIT/NTID with support from Google.org and Google.
- It uses Gemini to create individual multiple-choice questions.
- Presents content through ASL video, English captions, spoken narration, and transcripts, adapting the situation and difficulty to each student.
The design process and the results of cutting restrictions
The NAI documents describe a structured process: investigate, build and refine, then iterate based on feedback. In another study on video accessibility, the team:
- Defined target users across the spectrum from totally blind to sighted.
- Conducted co-design testing sessions with users with approximately 20 participants.
- It went through more than 40 iterations informed by 45 response times.
The resulting interfaces are expected to produce a cutting effect. Features designed for users with disabilities – such as better navigation, voice interaction, and flexible summarization – often improve usability for a wider population, including non-disabled users who face time pressures, cognitive load, or environmental barriers.
Key Takeaways
- The agent is a UI, not a plugin: Natively Adaptive Interfaces (NAI) treats the multimodal AI agent as the main interaction layer, so accessibility is handled by the agent directly in the main UI, not as a separate overlay or post-hoc feature.
- Orchestrator + architectural sub-agent: NAI uses a central Orchestrator that maintains shared context and routes running on specialized sub-agents (for example, summarizing or configuring settings), turning static navigation trees into dynamic, agent-driven modules.
- Multimodal Gemini + RAG for a versatile experience: Prototypes such as the Multimodal Agent Video Player create dense visual cues and use retrieval-augmented generation with Gemini to support interactive, ground-based Q&A during video playback and other rich media scenarios.
- Real systems: StreetReaderAI, MAVP, Grammar Laboratory: NAI is anchored in physical tools: StreetReaderAI for navigation, MAVP for video accessibility, and Grammar Laboratory for ASL/English learning, all powered by multimodal agents.
- Accessibility as a key design constraint: The framework encodes the accessibility code into configuration patterns (find purpose, add context, adjust settings) and implements the cross-cutting effect, where addressing disabled users improves resilience and usability for a wider user base.
Check it out Technical details here. Also, feel free to follow us Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.



