Published April 16, 2018

Yacht TV

Automatically detect boats and save video clips as they pass. Based on a Google AIY Vision Kit.

IntermediateFull instructions provided2 hours2,488

Things used in this project

Hardware components

Google AIY Vision

Story

The Origin Story

Yacht TV’s inspiration came originally from the waterway just next to our office. I’d stare at it in the minutes when I felt brain-tired from working, watching the boats go by and listening to the call and response between the taller ones and the drawbridge operator. Just upstairs, my colleague Christiana Caro and her desk-neighbors had created Yacht TV, a simple cardboard frame, through which she shot a video series of boats passing by, which she shared on Instagram. It was a hit.

It should surprise no one that we discussed automating it. I even went to the trouble of bringing a Raspberry Pi and camera into the office to give it a shot, with the idea of using machine learning to detect the boats as they sailed down the canal. Building such a machine learning workflow seemed like a lot of work. Months went by with my project hardware sitting accusationally idle, until I heard that one of our partner teams had released the AIY Vision Kit. Suddenly, an automated Yacht TV seemed within the realm of possibility. Christiana and I decided to create it as an AMI project in the vein of slow television, using the Vision Kit as a platform.

The Approach

With AIY Vision Kit's ability to run machine learning models locally, neural network inference is suddenly widely accessible. Computer vision tasks that were previously the purview of large organizations with server farms can now be performed with less than a hundred dollars of hardware. My goal for this project was to demonstrate how powerful and easy this newly available on-device machine intelligence can be.

With that in mind, the plan was to use one of the off-the-shelf neural network models included with the AIY Vision Kit to detect boats in the canal. On detection, we’d save a loop of video, modify it to our purposes, then upload to social media outlets (Youtube and Twitter).

First Attempts

After some quick experimentation, I settled on an obvious choice for the network model: the MobileNet model trained on the ImageNet dataset. The network is included with the Vision Kit, runs well on its hardware, and the training data contains 1000 classes, many of which cover different kinds of boats. On the principle of picking the low-hanging fruit first, I tried running the image classifier demo supplied by AIY on images directly from the camera. The code in that example provides a list of labels and confidence scores for the classes that the model has been trained on, thresholded by a given confidence level. I identified the following labels as boat-related, which I planned to look for in images from the camera:

catamaran

container ship/containership/container vessel

lifeboat

speedboat

paddle/boat paddle

pirate/pirate ship

paddlewheel/paddle wheel

submarine/pigboat/sub/U-boat

fireboat

With AIY, this sort of detection is incredibly simple to set up. Here’s a code snippet that sketches how it’s done:

Open a PIL Image, here from the Raspberry Pi camera:

with picamera.PiCamera(resolution=(1920, 1080)) as camera:     
stream = io.BytesIO()
 camera.capture(stream, format='jpeg')
 stream.seek(0)
 image = Image.open(stream)

Then feed it to the image classifier:

with ImageInference(image_classification.model(model_type)) as inference:
 infer_classes = image_classification.get_classes(
     inference.run(image), max_num_objects=5, object_prob_threshold=0.05)

Here are some of the results:

Result 0: lakeside/lakeshore (prob=0.702637)

Result 0: lakeside/lakeshore (prob=0.854004)

Result 0: lakeside/lakeshore (prob=0.441406 Result 1: canoe (prob=0.115173)

Result 0: boathouse (prob=0.687988) Result 1: lakeside/lakeshore (prob=0.086182)

As we can see, this isn’t likely to consistently detect our yachts: It completely missed all but the second boat, which it classified as a canoe with a confidence of only 11.5 percent. Note that the boathouse class seen in the fourth pane refers to a land structure for boats rather than a houseboat or floating home. Thinking about how the model was trained, it’s not hard to imagine why; if you were to look for some example images from the ImageNet set, you’d see that each image consists of one object that belongs to a well-defined class cropped to take up most of the scene. This is rarely the case in our application: most of the boats that we’re interested in cover only a small region of the overall field of view, so it makes sense that the model would have missed them.

The simplest solution to this problem is to crop. I wrote some code to read an image from the camera and generate a bunch of fifty-percent-overlapping crops at half scale and quarter scale over the field of view. The code would then push these cropped images to the Vision Kit’s daughter board for inference. If a boat were to appear anywhere in the full-size image, the model would hopefully be able to identify it in at least one of the crops. The first iteration produced over 80 crops, which at an average rate of slightly more than a second per inference round-trip, meant that detection for a single image would take more than a minute, which is a problem. Most boats transit the field of view in under 20 seconds, so a given boat would have less than a 1-in-3 chance of even being seen.

I let this run for a day or two, and collected statistics over the detections that were made. Here’s a snippet of the model’s output, which includes the bounding box of the crop in addition to class name and confidence probability:

(0, 0) + (960, 540) Result 0: fireboat (prob=0.288086)
(0, 0) + (960, 540) Result 1: lakeside/lakeshore (prob=0.181763)
(0, 0) + (960, 540) Result 2: quill/quill pen (prob=0.073425)
(960, 0) + (960, 540) Result 0: fireboat (prob=0.179688)
(960, 0) + (960, 540) Result 1: lakeside/lakeshore (prob=0.074341)
(0, 540) + (960, 540) Result 0: lakeside/lakeshore (prob=0.311279)
(960, 540) + (960, 540) Result 0: tricycle/trike/velocipede (prob=0.159058)
(960, 540) + (960, 540) Result 1: jinrikisha/ricksha/rickshaw (prob=0.150635)
(960, 540) + (960, 540) Result 2: bicycle-built-for-two/tandem bicycle/tandem (prob=0.126831)
(960, 540) + (960, 540) Result 3: horse cart/horse-cart (prob=0.092102)
(960, 540) + (960, 540) Result 4: mountain bike/all-terrain bike/off-roader (prob=0.075439)
(0, 270) + (960, 540) Result 0: lakeside/lakeshore (prob=0.300049)
(0, 270) + (960, 540) Result 1: fireboat (prob=0.106995)
...

All Crop Bounds in Which a Boat Class Was Detected

The image above shows a heat map of the crop bounding boxes in which a boat object was detected as determined by the neural network, overlaid on a typical clear image of the waterway. Many of these are false detections. For instance, we clearly wouldn’t expect boats to be found in the sky. That said, there’s a band of boat detections running right through the center of the image.

All Crop Bounds in Which a Boat Class Was Detected With Confidence > 0.25

This is a similar heatmap, but limited to boat-class detections in which the model’s confidence is over 25%. The band of detections across the middle of the image is even more evident.

All Crop Bounds in Which a Boat Class Was Detected for Images That Actually Contain Boats

I made this heatmap by collecting all of the images in the first 200 that actually have boats in them (regardless of confidence level). Note that half- and quarter-scale crops are represented in all of these heatmaps, all with the previously mentioned 50% overlaps. As seen by the dominance of their boundaries in the heatmaps, it is useful to know that ¼ scale images are better for detection than ½ scale, but it is pretty obvious that only the crops that occur over roughly the center of the image corresponding to the outline of the canal itself would generate useful detections. The next iteration of my code used only the seven remaining ¼ scale crops from that region rather than 80+ cropped images before. These 7 can be processed in under 8 seconds altogether, so we typically get around 2-3 opportunities to classify any given boat, which is much better.

Arriving at a Working Version

With the basic workflow designed, it was time to determine how to make the detections themselves. As mentioned before, the existing AIY image classifier yields a set of class labels and confidence probabilities for each image. We should be able to detect a yacht by looking for outputs that contain our identified boat classes with a reasonable degree of confidence. In the past I’ve seen many machine learning algorithms be overconfident in their output. To mitigate such false positives, I originally designed a detector that looked for consistency in the classifier output over neighboring frames. This approach, as it turned out, was over-engineered, and resulted in too high a rate of false negatives and to my surprise, also false positives. I took a deeper look into the classification data and confidence outputs. First I plotted the single-highest confidence for a boat-related class in any cropped image frame:

Single-Highest Confidence for Boat-Related Classes With 0.33 Threshold

Image of a Yacht That Meets the Threshold

Empty Image That Meets the Threshold

Image of a Yacht That Does Not Meet the Threshold

This is suggestive, but still not quite good enough. By thresholding at a probability of 0.33, we do capture some good images of yachts, but we also capture many empty images, and miss some boats that are pretty obvious. Inspecting the classifier output, it appeared that a lot of the boat images were being confused among a handful of different boat related classes. It wasn’t uncommon to see an image that was clearly of a boat classified as, say, a fireboat at 30% confidence, a container ship at 15% confidence and a speedboat at 5% confidence. So, I created a new plot:

Single-Highest (blue) and Sum-Highest (cyan) Confidence for Boat-Related Classes With 0.5 Threshold

The lower plot shows the maximum summed confidence over all boat classes in a given crop. In other words, for each of the 7 crops taken from a single frame, I computed the sum of confidences over all boat classes, for a total of 7 values, one per crop. I then took the maximum of those sums as the representative for the whole frame, as plotted above. This has the effect of “boosting” the previous signal by adding in less-confident boat predictions. In this plot, a threshold of about 0.5 looks like it should reliably filter actual boats, and indeed, this has borne out pretty well. The three example images above are now classified correctly.

A rough analysis shows that we now detect more than three quarters of the boats that pass by. These misses are usually because the boat passes outside of our bounding box, don’t look much like a typical boat (floating homes, for instance), or are so large that only a portion of them fit in view. Also, we appear to have a false positive rate of 2% or less. These are usually caused by raindrops that coincide with the remnant wake of an undetected boat.

So, this is how the detector works:

Capture an image from the Raspberry Pi’s Camera.

Take fifty-percent-overlapping, ¼ scale crops from the center band of the image.

For each of these crops:

Run the AIY-supplied image inference function.

Sum the confidence probabilities for only the supplied classes of interest.

If this sum is above the threshold of 50%, call it a detection.

The rest of Yacht TV is a set of scripts that eventually result in videos posting to our YouTube and Twitter channels, complete with a Yacht TV frame. The full code for the object detection pipeline is available below so you can easily create your own video-capture detector over the object classes that you think are interesting.

Video Detection Capture

#!/usr/bin/env python3
# Copyright 2017 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Video capture by class detection demo.
This script continuously monitors the Raspberry Camera and tries to detect
instances of a set of specified classes/categories. When on is detected a
short video file is written capturing briefly before and after the capture.
Example usage:
python video_capture.py -c boat_classes.txt --out_dir my_captures/
The file boat_classes.txt contains the desired set of classes to look for.
It is simply a text file containing one class per line:
catamaran
container ship/containership/container vessel
lifeboat
speedboat
paddle/boat paddle
pirate/pirate ship
paddlewheel/paddle wheel
submarine/pigboat/sub/U-boat
fireboat
A full list of possible categories can be obtained here:
<url>
"""
import argparse
import io
import numpy as np
import os
import picamera
import pickle
import sys
import time
from PIL import Image
from aiy.vision.inference import ImageInference
from aiy.vision.models import image_classification
def crop_parameters(im, range_x=[0, 1], range_y=[0, 1]):
  """Yields crop parameters for the given x- and y-ranges"""
  size = np.array(im.size).astype(np.int)
  crop_size = (size / 4).astype(np.int)
  step = (crop_size / 2).astype(np.int)
  x_start = int(range_x[0] * size[0])
  x_end = int(range_x[1] * size[0] - crop_size[0]) + 1
  y_start = int(range_y[0] * size[1])
  y_end = int(range_y[1] * size[1] - crop_size[1]) + 1
  for y in range(y_start, y_end, step[1]):
    for x in range(x_start, x_end, step[0]):
      yield (x, y, x + step[0] * 2, y + step[1] * 2)
debug_idx = 0
def debug_output(image, debug_data, out_dir, filename=None):
  """Outputs debug output if --debug is specified."""
  global debug_idx
  if debug_idx == 0:
    for filepath in [f for f in os.listdir(out_dir) if f.startswith('image_')]:
      try:
        path_idx = int(filepath[6:12]) + 1
        debug_idx = max(debug_idx, int(filepath[6:12]) + 1)
      except:
        pass
  print('debug_idx:', debug_idx)
  if filename == None:
    output_path = os.path.join(out_dir, 'image_%06d.jpg' % (debug_idx))
    debug_idx += 1
  else:
    output_path = os.path.join(out_dir, filename)
  image.save(output_path)
  with open(output_path + '_classes.txt', 'w') as f:
    for debug_tuple in debug_data:
      f.write('%s + %s Result %d: %s (prob=%f)\n' % debug_tuple)
  with open(output_path + '_classes.pkl', 'wb') as f:
    pickle.dump(debug_data, f, protocol=0)
def detect_object(inference, camera, classes, threshold, out_dir, range_x=[0,1], range_y=[0,1]):
  """Detects objects belonging to given classes in camera stream."""
  stream = io.BytesIO()
  camera.capture(stream, format='jpeg')
  stream.seek(0)
  image = Image.open(stream)
  # Every so often, we get an image with a decimated green channel
  # Skip these.
  rgb_histogram = np.array(image.histogram()).reshape((3, 256))
  green_peak = np.argmax(rgb_histogram[1, :])
  if green_peak < 3:
    time.sleep(1.0)
    return False, None, None
  debug_data = []
  detection = False
  max_accumulator = 0.
  print ('Inferring...')
  for p in crop_parameters(image, range_x, range_y):
    im_crop = image.crop(p)
    accumulator = 0.
    infer_classes = image_classification.get_classes(
      inference.run(im_crop), max_num_objects=5, object_prob_threshold=0.05)
    corner = [p[0], p[1]]
    print (corner)
    for idx, (label, score) in enumerate(infer_classes):
      debug_data.append((corner, im_crop.size, idx, label, score))
      if label in classes:
        accumulator += score
    if accumulator > max_accumulator:
      max_accumulator = accumulator
    if accumulator >= threshold:
      detection = True
      break
  if out_dir != '':
    debug_output(image, debug_data, out_dir)
  print ('Accumulator: %f' % (max_accumulator))
  print ('Detection!' if detection else 'Non Detection')
  return detection, image, debug_data
def main():
  parser = argparse.ArgumentParser()
  parser.add_argument('--classfile', '-c', dest='classfile', required=True)
  parser.add_argument('--threshold', '-t', dest='threshold', required=False, type=float, default=0.5)
  parser.add_argument('--out_dir', '-o', dest='out_dir', required=False, type=str, default='./')
  parser.add_argument('--capture_delay', dest='capture_delay', required=False, type=float, default=5.0)
  parser.add_argument('--capture_length', dest='capture_length', required=False, type=int, default=20)
  parser.add_argument('--debug', '-d', dest='debug', required=False, action='store_true')
  ## Crop box in fraction of the image width. By default full camera image is processed.
  parser.add_argument('--cropbox_left', dest='cropbox_left', required=False, type=float, default=0.0)
  parser.add_argument('--cropbox_right', dest='cropbox_right', required=False, type=float, default=1.0)
  parser.add_argument('--cropbox_top', dest='cropbox_top', required=False, type=float, default=0.0)
  parser.add_argument('--cropbox_bottom', dest='cropbox_bottom', required=False, type=float, default=1.0)
  parser.set_defaults(debug=False)
  args = parser.parse_args()
  # There are two models available for image classification task:
  # 1) MobileNet based (image_classification.MOBILENET), which has 59.9% top-1
  # accuracy on ImageNet;
  # 2) SqueezeNet based (image_classification.SQUEEZENET), which has 45.3% top-1
  # accuracy on ImageNet;
  model_type = image_classification.MOBILENET
  # Read the class list from a text file
  with open(args.classfile) as f:
    classes = [line.strip() for line in f]
  print('Starting camera detection, using the following classes:')
  for label in classes: print ('  ', label)
  print('Threshold:', args.threshold)
  print('Debug mode:', args.debug)
  print('Capture Delay:', args.capture_delay)
  debug_out = args.out_dir if args.debug else ''
  with ImageInference(image_classification.model(model_type)) as inference:
    with picamera.PiCamera(resolution=(1920, 1080)) as camera:
      stream = picamera.PiCameraCircularIO(camera, seconds=args.capture_length)
      camera.start_recording(stream, format='h264')
      while True:
        detection, image, inference_data = detect_object(
          inference, camera, classes, args.threshold, debug_out,
          [args.cropbox_left, args.cropbox_right],
          [args.cropbox_top, args.cropbox_bottom])
        if detection:
          detect_time = int(time.time())
          camera.wait_recording(args.capture_delay)
          video_file = 'capture_%d.mpeg' % (detect_time)
          image_file = 'capture_%d.jpg' % (detect_time)
          stream.copy_to(os.path.join(args.out_dir, video_file))
          stream.flush()
          debug_output(image, inference_data, args.out_dir, image_file)
          print('Wrote video file to', os.path.join(args.out_dir, video_file))
          camera.wait_recording(max(args.capture_length - args.capture_delay, 0))
if __name__ == '__main__':
    main()