Watch and Learn
CMU’s PrISM-Q&A uses a motion-sensing smartwatch and AI to give context-aware task help, making voice assistants smarter and easier to use.
How did anyone ever get anything done before the internet era? Fixing a leaky faucet is simple enough when you can get some pointers from a video on YouTube. And changing the oil in your car is a snap after reading through a step-by-step guide. But can you imagine having to rely on nothing more than word of mouth or incomprehensible user manuals for answers to your questions? Fortunately, we can avoid that pain because we have an endless source of information to help us with everything (until AWS crashes the entire internet again).
With the rise of generative artificial intelligence, it is easier to get help than ever. Large language models, for instance, can walk us through a complex procedure, one step at a time. But that does require some context shifting to keep track of where you are in the process so that no steps are missed. A trio of researchers at Carnegie Mellon University thinks that is an unnecessary distraction, so they have created an approach that helps us stay focused on the task at hand.
They have created what they call PrISM-Q&A, which is a step-aware voice assistant that leverages a smartwatch to provide context to a large language model as it provides task instructions. Traditional voice assistants can only respond to the words a user says, which can lead to vague or incorrect answers when the question lacks context. PrISM-Q&A solves this problem by continuously monitoring the user’s activity through the smartwatch’s built-in sensors, like accelerometers and microphones, and using that information to infer which step of a procedure the user is currently performing.
For example, imagine you’re making coffee and ask, “What should I do with this?” A typical assistant would have no idea what “this” means. But PrISM-Q&A, recognizing through motion data that you’ve just emptied the portafilter, can infer that you’re cleaning up after brewing and suggest, “You can wash the portafilter with water.” By combining human activity recognition with the reasoning power of large language models, the system can provide answers that make sense in the moment, even when the question is vague or ambiguous.
To test how well the system worked, the team compared their smartwatch-based assistant to two other setups: a voice-only system that used no additional context, and a vision-based system similar to what one might find in smart glasses, which used visual information to aid responses. Participants performed real-world tasks such as cooking or making lattes while asking questions under each condition.
It was found that users preferred the step-aware smartwatch system. They found it intuitive, convenient, and less intrusive than camera-based alternatives. Many appreciated that they didn’t have to describe exactly what they were doing in order to get a helpful answer, and several noted that wearing a watch was far more comfortable than putting on AR glasses.
By grounding AI understanding in the user’s physical actions, PrISM-Q&A makes voice assistants far more useful than traditional options. Instead of needing to describe what we’re doing, our devices may soon already know, and that could make the future of task assistance a lot less frustrating.