Voice assistant helping elderly users navigate iOS through
real-time screen understanding and spatial audio guidance.
Achieved 92% task completion rate using vision-language models and
personalized interaction memory.
Voice AI for Non-Technical Seniors
Elderly users often struggle with smartphone interfaces: small
text, complex navigation, changing app layouts, and features
they've never seen before. Assistify is an iOS voice assistant
that helps seniors use their phones through natural
conversation. Users describe what they want to accomplish ("I
want to text my daughter"), and the system provides step-by-step
audio guidance while understanding the current screen
context—eliminating the gap between "I need help" and "here's
what to tap next."
Real-Time Screen Understanding and Guidance
The system uses vision-language models (Gemini API) to analyze
screenshots in real-time, identifying UI elements, reading
on-screen text, and mapping the layout. When a user asks for
help, Assistify generates contextualized instructions: "Tap the
green button in the bottom right corner" or "Scroll down and
look for the Settings icon." This requires understanding both
spatial relationships (where things are) and semantic meaning
(what they do)—bridging computer vision and language
understanding.
Spatial Audio for Always-Available Assistance
Rather than requiring users to open the app explicitly,
Assistify uses spatial audio to create distinct voice positions:
the assistant speaks from "left," the user from "center." This
lets users talk to the assistant anytime—even while using other
apps—because their voice is directionally distinct from other
audio. Combined with background screen recording, the system
maintains context across the entire phone experience, not just
when the app is foreground.
RAG-Powered Personalization Memory
The system stores user preferences, common tasks, and
interaction history in a pgvector database, enabling
personalized responses over time. If a user frequently texts
their daughter, Assistify learns that context and can say "Would
you like to text Sarah?" instead of generic "Would you like to
send a message?" This personalization layer, implemented via RAG
(retrieval-augmented generation), pulls relevant memories into
each conversation, making interactions feel natural and
contextual rather than robotic.
Achieving 92% Task Completion in User Testing
We tested Assistify with elderly participants across common
tasks: sending texts, making calls, finding apps, adjusting
settings, and navigating menus. The system achieved 92%
successful task completion, with most failures occurring when
users gave ambiguous instructions ("open the thing") rather than
system limitations. The work demonstrated that voice AI paired
with visual understanding could dramatically improve
accessibility—not by simplifying the phone, but by making
complexity navigable through conversation.