Handwritten Digit Classification with Arduino and MicroML

How to recognize handwritten digits in real-time directly on an ESP32 camera.

We continue exploring the endless possibilities on the MicroML (Machine Learning for Microcontrollers) framework on Arduino and ESP32 boards. In this post, we're back to image classification; in particular, we'll distinguish handwritten digits using an ESP32 camera.

If this is the first time you're reading my posts, you may have missed that I'm on a journey to push the limits of machine learning on embedded devices like the Arduino boards and ESP32. I started with accelerometer data classification, then did WiFi indoor positioning as a proof of concept.

In the last couple of weeks, though, I undertook a more difficult path that is image classification.

Image classification is where Convolutional Neural Networks really shine, but I'm here to question this settlement and demostrate that it is possible to come up with much lighter alternatives.

In this article, we continue with the examples, replicating a "benchmark" dataset in machine learning: the handwritten digits classification.

The task

The objective of this example is to be able to tell what an handwritten digit is, taking a photo as input from the ESP32 camera.

In particular, we have three handwritten numbers and the task of our model will be to distinguish which image is what number.

Handwritten digits example

I only have a single image per digit, but you're free to draw as many samples as you like: it should help improve the performance of your classifier.

1. Feature extraction

When dealing with images, if you use a CNN this step is often overlooked: CNNs are made on purpose to handle raw pixel values, so you just throw the image in and it is handled properly.

When using other types of classifiers, it could help add a bit of feature engineering to help the classifier doing its job and achieve high accuracy.

But not this time.

I wanted to be as "light" as possible in this demo, so I only took a couple steps during the feature acquisition:

Use a grayscale image
Downsample to a manageable size
Convert it to black/white with a threshold

I would hardly call this feature engineering.

This is an example of the result of this pipeline.

Handwritten digit feature extraction

The code for this pipeline is really simple and is almost the same from the example on motion detection.

#include "esp_camera.h"

#define PWDN_GPIO_NUM     -1
#define RESET_GPIO_NUM    15
#define XCLK_GPIO_NUM     27
#define SIOD_GPIO_NUM     22
#define SIOC_GPIO_NUM     23
#define Y9_GPIO_NUM       19
#define Y8_GPIO_NUM       36
#define Y7_GPIO_NUM       18
#define Y6_GPIO_NUM       39
#define Y5_GPIO_NUM        5
#define Y4_GPIO_NUM       34
#define Y3_GPIO_NUM       35
#define Y2_GPIO_NUM       32
#define VSYNC_GPIO_NUM    25
#define HREF_GPIO_NUM     26
#define PCLK_GPIO_NUM     21

#define FRAME_SIZE FRAMESIZE_QQVGA
#define WIDTH 160
#define HEIGHT 120
#define BLOCK_SIZE 5
#define W (WIDTH / BLOCK_SIZE)
#define H (HEIGHT / BLOCK_SIZE)
#define THRESHOLD 127

double features[H*W] = { 0 };

void setup() {
    Serial.begin(115200);
    Serial.println(setup_camera(FRAME_SIZE) ? "OK" : "ERR INIT");
    delay(3000);
}

void loop() {
    if (!capture_still()) {
        Serial.println("Failed capture");
        delay(2000);
        return;
    }

    print_features();
    delay(3000);
}

bool setup_camera(framesize_t frameSize) {
    camera_config_t config;

    config.ledc_channel = LEDC_CHANNEL_0;
    config.ledc_timer = LEDC_TIMER_0;
    config.pin_d0 = Y2_GPIO_NUM;
    config.pin_d1 = Y3_GPIO_NUM;
    config.pin_d2 = Y4_GPIO_NUM;
    config.pin_d3 = Y5_GPIO_NUM;
    config.pin_d4 = Y6_GPIO_NUM;
    config.pin_d5 = Y7_GPIO_NUM;
    config.pin_d6 = Y8_GPIO_NUM;
    config.pin_d7 = Y9_GPIO_NUM;
    config.pin_xclk = XCLK_GPIO_NUM;
    config.pin_pclk = PCLK_GPIO_NUM;
    config.pin_vsync = VSYNC_GPIO_NUM;
    config.pin_href = HREF_GPIO_NUM;
    config.pin_sscb_sda = SIOD_GPIO_NUM;
    config.pin_sscb_scl = SIOC_GPIO_NUM;
    config.pin_pwdn = PWDN_GPIO_NUM;
    config.pin_reset = RESET_GPIO_NUM;
    config.xclk_freq_hz = 20000000;
    config.pixel_format = PIXFORMAT_GRAYSCALE;
    config.frame_size = frameSize;
    config.jpeg_quality = 12;
    config.fb_count = 1;

    bool ok = esp_camera_init(&config) == ESP_OK;

    sensor_t *sensor = esp_camera_sensor_get();
    sensor->set_framesize(sensor, frameSize);

    return ok;
}

bool capture_still() {
    camera_fb_t *frame = esp_camera_fb_get();

    if (!frame)
        return false;

    // reset all the features
    for (size_t i = 0; i < H * W; i++)
      features[i] = 0;

    // for each pixel, compute the position in the downsampled image
    for (size_t i = 0; i < frame->len; i++) {
      const uint16_t x = i % WIDTH;
      const uint16_t y = floor(i / WIDTH);
      const uint8_t block_x = floor(x / BLOCK_SIZE);
      const uint8_t block_y = floor(y / BLOCK_SIZE);
      const uint16_t j = block_y * W + block_x;

      features[j] += frame->buf[i];
    }

    // apply threshold
    for (size_t i = 0; i < H * W; i++) {
      features[i] = (features[i] / (BLOCK_SIZE * BLOCK_SIZE) > THRESHOLD) ? 1 : 0;
    }

    return true;
}

void print_features() {
    for (size_t i = 0; i < H * W; i++) {
        Serial.print(features[i]);

        if (i != H * W - 1)
          Serial.print(',');
    }

    Serial.println();
}

2. Samples recording

To create your own dataset, you need a collection of handwritten digits.

You can perform this part as you like, by using pieces of paper or a monitor. I used a tablet because it was well illuminated and I could open a bunch of tabs to keep a record of my samples.

As in the apple vs. orange, keep in mind that you should be consistent during both the training phase and the inference phase.

This is why I used tape to fix my ESP32 camera to the desk and kept the tablet in the exact same position.

If you desire, you could experiment varying slightly the capturing setup during the training and see if your classifier still achieves good accuracy: this is a test I didn't make.

3. Train and export the SVM classifier

For a detailed guide refer to the tutorial.

from sklearn.svm import SVC
from micromlgen import port

# put your samples in the dataset folder
# one class per file
# one feature vector per line, in CSV format
features, classmap = load_features('dataset/')
X, y = features[:, :-1], features[:, -1]
classifier = SVC(kernel='linear').fit(X, y)
c_code = port(classifier, classmap=classmap)
print(c_code)

At this point, you have to copy the printed code and import it in your Arduino project, in a file called model.h.

4. The result

Okay, by now you should have all the working pieces to do handwritten digit image classification on your ESP32 camera. Include your model in the sketch and run the classification.

#include "model.h"

void loop() {
    if (!capture_still()) {
        Serial.println("Failed capture");
        delay(2000);

        return;
    }

    Serial.print("Number: ");
    Serial.println(classIdxToName(predict(features)));
    delay(3000);
}

Done!

You can see a demo of my results in the video below.

Project figures

My dataset is composed of 25 training samples in total and the SVM with linear kernel produced 17 support vectors.

On my M5Stick camera board, the overhead for the model is 6.8 Kb of flash and the inference takes 7ms: not that bad!

Check the full project code on GitHub.

artificial intelligence

computer vision

Eloquent Arduino