Human Activity Recognition (HAR) has applications in human-computer interaction, healthcare, security monitoring, and user authentication to name a few. Unfortunately, HAR research is currently being impeded by a lack of available datasets. Models that are created from these sparse datasets, unsurprisingly, tend to not generalize well. Creating datasets requires humans to perform specific activities while wearing on-body sensors — this makes for a time-consuming and expensive process that also involves error-prone manual annotation.
Recognizing this problem, a group at the Georgia Institute of Technology had the idea to leverage the existing massive databases of human activity (e.g. YouTube) and automatically generate virtual streams of sensor data to pair with them. The virtual Inertial Measurement Unit (IMU) streams that are generated represent accelerometry readings from a wide variety of locations on the body.
The pipeline, named IMUTube, begins by using OpenPose to estimate 2D skeletons for subjects in the videos. Since OpenPose works on a frame-by-frame basis only, the SORT tracking algorithm was employed to track each person across the entire video sequence. Real world IMU measurements contain 3D information, so next VideoPose3D was used to map each 2D sequence into three dimensions. By tracking position and rotation of joints in the body over time, it is possible to generate virtual IMU measurements.
To assess the performance of the virtual IMU data, the researchers used an existing HAR dataset, which contained both physical IMU measurements and accompanying videos. The videos were processed through the IMUTube pipeline to extract virtual IMU measurements. The researchers then used each dataset to train an activity classification model. They found the virtual data to achieve an F1-score that was within 2% of what was achieved by learning from real IMU data. Furthermore, when mixing the real and virtual data together, the results surpassed the performance of real data alone by 12%.
Data collection is a problem in many areas of machine learning, so it is exciting to see a new method that helps remove some barriers. The team believes IMUTube will help to make significant contributions to HAR in the next few years, but note that there are some areas needing improvement — improvements in signal processing are needed to better condition the virtual IMU data to more closely represent the features and distributions of real data, and computer vision advancements are needed to reduce the need for human curation of videos.