A properly-trained deep learning model can seem practically magical, because many deep learning systems are able to make connections that even their human programmers have difficulty understanding. But that capability and the accuracy of a deep learning model relies entirely on the size and quality of the data set that is trained on. Modern object recognition models are usually trained on many thousands of images. If you want to detect a hotdog, for example, you need to train your model using as many pictures as you can, each taken from slightly different angles and with minor lighting changes. That is incredibly time consuming, which is why Hugo Ponte wrote an article explaining how a Raspberry Pi recognition model was trained using synthetic CGI data.
The idea here is pretty simple: load a 3D model and have a computer automatically render thousands of images of that model with slight variations. That requires far less human labor than manually snapping the photos needed to build a decent training data set. The problem, however, is that deep learning models don’t “see” an image in the same way we do. A model can easily focus on some small detail in the synthetic images that we don’t even perceive and that isn’t present in a real image. This reduces the model’s accuracy, and can even result in a model that can only recognize CGI models and not their real world counterparts. This problem is known as the “sim2real gap” and is what Ponte’s article focuses on overcoming.
The deep learning model used as a demonstration in Ponte’s article is intended to identify the pins and ports on a Raspberry Pi single-board computer. The idea is that someone could point their phone at their Raspberry Pi, and this model would tell them which port is which. They created their synthetic data set by loading a model of the Raspberry Pi into a video game engine to automatically capture thousands of images. That was used to train a model based on PyTorch’s torchvision library. To help narrow the sim2real gap, Ponte used a technique called domain randomization. This technique randomly changes the visual properties of the Raspberry Pi 3D model, such as its color and texture, between renders. Using this technique reduces the likelihood that the model will focus on some unintended aspect of the training images.
To test this, Ponte built four training data sets. Two had 15 thousand images each and two had six thousand images each. One of the 15 thousand image sets included domain randomized images, as well as one of the six thousand image sets. Predictably, the data sets that contained more images performed better than their smaller counterparts. But, the interesting thing is that the sets that contained domain randomized data performed far better than either of the sets that didn’t. What does this mean? First, it shows that domain randomized training sets can help overcome the sim2real gap. Second, it proves that it’s better to focus on domain randomization than on simply gathering a greater number of “normal” synthetic images.