Yacht TV’s inspiration came originally from the waterway just next to our office. I’d stare at it in the minutes when I felt brain-tired from working, watching the boats go by and listening to the call and response between the taller ones and the drawbridge operator. Just upstairs, my colleague Christiana Caro and her desk-neighbors had created Yacht TV, a simple cardboard frame, through which she shot a video series of boats passing by, which she shared on Instagram. It was a hit.
It should surprise no one that we discussed automating it. I even went to the trouble of bringing a Raspberry Pi and camera into the office to give it a shot, with the idea of using machine learning to detect the boats as they sailed down the canal. Building such a machine learning workflow seemed like a lot of work. Months went by with my project hardware sitting accusationally idle, until I heard that one of our partner teams had released the AIY Vision Kit. Suddenly, an automated Yacht TV seemed within the realm of possibility. Christiana and I decided to create it as an AMI project in the vein of slow television, using the Vision Kit as a platform.
With AIY Vision Kit's ability to run machine learning models locally, neural network inference is suddenly widely accessible. Computer vision tasks that were previously the purview of large organizations with server farms can now be performed with less than a hundred dollars of hardware. My goal for this project was to demonstrate how powerful and easy this newly available on-device machine intelligence can be.
With that in mind, the plan was to use one of the off-the-shelf neural network models included with the AIY Vision Kit to detect boats in the canal. On detection, we’d save a loop of video, modify it to our purposes, then upload to social media outlets (Youtube and Twitter).
After some quick experimentation, I settled on an obvious choice for the network model: the MobileNet model trained on the ImageNet dataset. The network is included with the Vision Kit, runs well on its hardware, and the training data contains 1000 classes, many of which cover different kinds of boats. On the principle of picking the low-hanging fruit first, I tried running the image classifier demo supplied by AIY on images directly from the camera. The code in that example provides a list of labels and confidence scores for the classes that the model has been trained on, thresholded by a given confidence level. I identified the following labels as boat-related, which I planned to look for in images from the camera:
- container ship/containership/container vessel
- paddle/boat paddle
- pirate/pirate ship
- paddlewheel/paddle wheel
With AIY, this sort of detection is incredibly simple to set up. Here’s a code snippet that sketches how it’s done:
Open a PIL Image, here from the Raspberry Pi camera:
with picamera.PiCamera(resolution=(1920, 1080)) as camera: stream = io.BytesIO() camera.capture(stream, format='jpeg') stream.seek(0) image = Image.open(stream)
Then feed it to the image classifier:
with ImageInference(image_classification.model(model_type)) as inference: infer_classes = image_classification.get_classes( inference.run(image), max_num_objects=5, object_prob_threshold=0.05)
Here are some of the results:
As we can see, this isn’t likely to consistently detect our yachts: It completely missed all but the second boat, which it classified as a canoe with a confidence of only 11.5 percent. Note that the boathouse class seen in the fourth pane refers to a land structure for boats rather than a houseboat or floating home. Thinking about how the model was trained, it’s not hard to imagine why; if you were to look for some example images from the ImageNet set, you’d see that each image consists of one object that belongs to a well-defined class cropped to take up most of the scene. This is rarely the case in our application: most of the boats that we’re interested in cover only a small region of the overall field of view, so it makes sense that the model would have missed them.
The simplest solution to this problem is to crop. I wrote some code to read an image from the camera and generate a bunch of fifty-percent-overlapping crops at half scale and quarter scale over the field of view. The code would then push these cropped images to the Vision Kit’s daughter board for inference. If a boat were to appear anywhere in the full-size image, the model would hopefully be able to identify it in at least one of the crops. The first iteration produced over 80 crops, which at an average rate of slightly more than a second per inference round-trip, meant that detection for a single image would take more than a minute, which is a problem. Most boats transit the field of view in under 20 seconds, so a given boat would have less than a 1-in-3 chance of even being seen.
I let this run for a day or two, and collected statistics over the detections that were made. Here’s a snippet of the model’s output, which includes the bounding box of the crop in addition to class name and confidence probability:
(0, 0) + (960, 540) Result 0: fireboat (prob=0.288086) (0, 0) + (960, 540) Result 1: lakeside/lakeshore (prob=0.181763) (0, 0) + (960, 540) Result 2: quill/quill pen (prob=0.073425) (960, 0) + (960, 540) Result 0: fireboat (prob=0.179688) (960, 0) + (960, 540) Result 1: lakeside/lakeshore (prob=0.074341) (0, 540) + (960, 540) Result 0: lakeside/lakeshore (prob=0.311279) (960, 540) + (960, 540) Result 0: tricycle/trike/velocipede (prob=0.159058) (960, 540) + (960, 540) Result 1: jinrikisha/ricksha/rickshaw (prob=0.150635) (960, 540) + (960, 540) Result 2: bicycle-built-for-two/tandem bicycle/tandem (prob=0.126831) (960, 540) + (960, 540) Result 3: horse cart/horse-cart (prob=0.092102) (960, 540) + (960, 540) Result 4: mountain bike/all-terrain bike/off-roader (prob=0.075439) (0, 270) + (960, 540) Result 0: lakeside/lakeshore (prob=0.300049) (0, 270) + (960, 540) Result 1: fireboat (prob=0.106995) ...
The image above shows a heat map of the crop bounding boxes in which a boat object was detected as determined by the neural network, overlaid on a typical clear image of the waterway. Many of these are false detections. For instance, we clearly wouldn’t expect boats to be found in the sky. That said, there’s a band of boat detections running right through the center of the image.
This is a similar heatmap, but limited to boat-class detections in which the model’s confidence is over 25%. The band of detections across the middle of the image is even more evident.
I made this heatmap by collecting all of the images in the first 200 that actually have boats in them (regardless of confidence level). Note that half- and quarter-scale crops are represented in all of these heatmaps, all with the previously mentioned 50% overlaps. As seen by the dominance of their boundaries in the heatmaps, it is useful to know that ¼ scale images are better for detection than ½ scale, but it is pretty obvious that only the crops that occur over roughly the center of the image corresponding to the outline of the canal itself would generate useful detections. The next iteration of my code used only the seven remaining ¼ scale crops from that region rather than 80+ cropped images before. These 7 can be processed in under 8 seconds altogether, so we typically get around 2-3 opportunities to classify any given boat, which is much better.
With the basic workflow designed, it was time to determine how to make the detections themselves. As mentioned before, the existing AIY image classifier yields a set of class labels and confidence probabilities for each image. We should be able to detect a yacht by looking for outputs that contain our identified boat classes with a reasonable degree of confidence. In the past I’ve seen many machine learning algorithms be overconfident in their output. To mitigate such false positives, I originally designed a detector that looked for consistency in the classifier output over neighboring frames. This approach, as it turned out, was over-engineered, and resulted in too high a rate of false negatives and to my surprise, also false positives. I took a deeper look into the classification data and confidence outputs. First I plotted the single-highest confidence for a boat-related class in any cropped image frame:
This is suggestive, but still not quite good enough. By thresholding at a probability of 0.33, we do capture some good images of yachts, but we also capture many empty images, and miss some boats that are pretty obvious. Inspecting the classifier output, it appeared that a lot of the boat images were being confused among a handful of different boat related classes. It wasn’t uncommon to see an image that was clearly of a boat classified as, say, a fireboat at 30% confidence, a container ship at 15% confidence and a speedboat at 5% confidence. So, I created a new plot:
The lower plot shows the maximum summed confidence over all boat classes in a given crop. In other words, for each of the 7 crops taken from a single frame, I computed the sum of confidences over all boat classes, for a total of 7 values, one per crop. I then took the maximum of those sums as the representative for the whole frame, as plotted above. This has the effect of “boosting” the previous signal by adding in less-confident boat predictions. In this plot, a threshold of about 0.5 looks like it should reliably filter actual boats, and indeed, this has borne out pretty well. The three example images above are now classified correctly.
A rough analysis shows that we now detect more than three quarters of the boats that pass by. These misses are usually because the boat passes outside of our bounding box, don’t look much like a typical boat (floating homes, for instance), or are so large that only a portion of them fit in view. Also, we appear to have a false positive rate of 2% or less. These are usually caused by raindrops that coincide with the remnant wake of an undetected boat.
So, this is how the detector works:
- Capture an image from the Raspberry Pi’s Camera.
- Take fifty-percent-overlapping, ¼ scale crops from the center band of the image.
- For each of these crops:
- Run the AIY-supplied image inference function.
- Sum the confidence probabilities for only the supplied classes of interest.
- If this sum is above the threshold of 50%, call it a detection.
The rest of Yacht TV is a set of scripts that eventually result in videos posting to our YouTube and Twitter channels, complete with a Yacht TV frame. The full code for the object detection pipeline is available below so you can easily create your own video-capture detector over the object classes that you think are interesting.