Human pose estimation seeks to locate key points on the human body in images or videos. Much effort has been made in this area, and some very sophisticated tools exist. However, these current tools tend to be trained on idealized videos that contain a fully visible subject, and as such, they generally perform poorly when only part of a person is visible. This excludes a vast amount of footage from online video repositories from being used meaningfully to teach machines about the meanings of various body positions and the ways in which people interact with the environment. Only about four percent of videos in online repositories contain an entire person, head to toe, in the frame. Further complicating the situation is the fact that these repository videos are unlabeled, which hinders large-scale automated learning tasks. Pose estimation tools would typically be trained on labeled data to supply the training algorithm with the ground truth pose information to help it improve.
A pair of researchers from the University of Michigan have put forth some new ideas to help deal with these issues. In a move that will make many of you say “Why didn’t I think of that!,” they first took the idealized video datasets used to train existing models, then cropped them in various ways such that only a portion of the body was shown in each. Using this data to retrain the existing models proved to significantly improve their performance in dealing with videos in the wild. Before retraining, the models gave more or less random pose results. Afterwards the model was able to clearly indicate where key points of the body were located.
To overcome the problem of missing data labels the team trained the model on several iterations of each frame. In each version the frame was manipulated by shifting it slightly in different directions. The calculated guesses were aggregated from each version of the frame. Where the model returns the same prediction consistently across versions, that gives confidence that the model is accurately classifying what it sees. If it were making low confidence guesses, then small perturbations of images would be expected to change the answer between versions.
Combining these techniques produced a pipeline that can train on unlabeled, real-world videos. The researchers see this as opening the door to teaching models to better understand humans and how they interact with various objects.