
voice experience design for smart glasses
project overview
Context: Conducting user research to optimize voice user experience (VUX) for a social media client's augmented reality (AR) smart glasses.
Objective: Enhancing voice-visual integration and conversational experiences for international users across different cultures and contexts.
Role: UX Researcher
Duration: 10 months
Methods: Surveys, Usability Testing
(This project is under NDA so all research artifacts have been modified to preserve confidentiality.)
challenge
Our initial research scope covered multiple areas, including conversational experience, wear comfort, and mobile app usability.
Analysis revealed that optimizing VUX was crucial for user success, leading us to prioritize this aspect of the experience.
"ficus, which means tree, was interpreted by the assistant as 'focus'."
-australian participant
“the Assistant doesn’t recognize foreign names like ‘mohammed’, who is my supervisor. I need to reach him.”
-u.s. participant
process
Each week, I collected over 400 survey responses from study participants who were creating voice transcripts as training data for speech-recognition software. While the survey format had limitations, participants provided remarkably rich and nuanced descriptions of their device interactions.
Participants rated communication across various environments using a 5-point Likert scale. Open-ended questions gathered specific details about:
Accents, or pronunciation
Cadence
Emphasis
Volume
Pitch
Tone
To analyze the large volume of text data, I used Natural Language Processing (NLP) tools like topic modeling to identify patterns. This statistical method uses unsupervised Machine Learning (ML) to discover word clusters or "topics" in document collections. The tree graph below visualizes how these topics ranked in importance to our participants.
I pinpointed a few recurring problems within these larger topics to tackle with the product team.
pronunciation
Participants were often bothered that the assistant used phonetic pronunciation without considering cultural or linguistic norms.
Ironically, while the assistant relied too rigidly on phonetic rules, it also struggled with homonyms - words having the same spelling or phonetic pronunciation but different meanings and origins.
system memory
Users expressed frustration at having to repeat information, particularly proper names, due to a system memory that inconsistently tracked relevant metadata.
cognitive load
The most significant pain point was the need to verbally confirm every step and listen to lengthy readouts of dictated messages.
Monotonous speech made task confirmation even more challenging, creating a cognitive load that undermined the intuitive nature of voice interaction.
“when I would say “call betty gram", the assistant would respond with "what is your grandmother’s name?"
-canadian participant
“the lack of variation in tone of voice made it hard to distinguish all the contact names.”
-u.s. participant
the results
My analysis and recommendations enabled the software team to improve device interactions through:
Implementing localization for regional dialect customization.
Training the language model for better word differentiation.
Optimizing Speech Synthetic Markup Language (SSML) parameters for more natural speech patterns.
I additionally collaborated with engineers to understand technical constraints around proper name storage and pronoun handling, informing prototype improvements.
one-shot messaging
My research led to designing a streamlined messaging approach called one-shot messaging, which enables users to compose and send messages in a single voice command. Instead of going through multiple confirmation steps, users can simply say:
"hey assistant, send a text to x saying meet me at the shops tomorrow".
For future iterations, I recommended adding a graphical interface for visual message confirmation. While one-shot messaging works well for brief interactions, an audiovisual experience can better facilitate longer message composition.
This recommendation stemmed from clear evidence that users benefit from visual task confirmation. Though AR smart glasses with a VUI proved more wearable than VR headsets, audio-only interactions showed clear limitations.
After all, humans can speak faster than we can type but can read faster than we can hear.