In cases where robots need to perform physical tasks, they must collect information about their surroundings to gain an understanding of the environment they find themselves in. This typically involves collecting visual and tactile data from sensors, and then processing it in some way appropriate for the task. A team from Carnegie Mellon University has just given us reason to also include sound information in this data collection.
The sound of an object being struck depends on the impact of the strike, the structure of the object, and positioning of the microphone. This combination of factors has made collecting useful data difficult and left a gap in available sound-action datasets. The researchers sought to overcome the present lack of good datasets by creating the largest sound-action dataset available. It consists of 15,000 interactions on over 60 objects which were captured by their custom Tilt-Bot robot.
Using the newly assembled dataset, the researchers have devised methods to accomplish three tasks by observing audio signatures emitted when actions occur in relation to objects. First, they used the audio data to determine what object made the observed sound. Next, they were also able to determine what action was applied to the object. Finally, they were able to predict future movements of the object.
Object classification was conducted by using a convolutional neural network, and achieved a 79.2% classification accuracy rate on the 60 objects in the dataset. The main areas for failure were cases where the only differences in objects were visual (e.g. a blue cube vs. a red cube) or cases where the volume of the sound was too low. The second point indicates that classification will not be accurate when actions are subtle and only cause small movements.
An inverse model was trained to understand what action was applied to an object, given observations before and after the action. Such information has applications in determining motor controls that must be applied to achieve a desired state. This model impressively performed 42% better than similar models trained only with visual data.
To predict future movements of objects, a forward model was trained. Such models have applications in error detection by determining if predicted and actual feedback do not match. As compared to visual models, the new model had a 24% lower error rate.
The researchers’ hope is that their work will inspire future advancements in the sound-action domain. As such, they have publicly released their data for anyone with interest.