Voice detection has been used in many scenarios in our daily life. For example, hey Siri, how is the weather today? OK Google, set an alert for tomorrow morning at 8 AM. Hey Alexa, play music on Amazon Echo speakers. This feature makes our life more convenient, but what happened behind the scenes? In this project, I trained a neural network model to recognize some words and run this model on the resource-limited IoT device Arduino Nano 33 BLE.
MethodTrain model
I trained a machine learning model on Google Colab and using Arduino Nano 33 BLE which is supposed to classify and recognize the input and divides the input as three different outputs: "YES" for the green light, "NO" for the red light, and "UNKNOWN" for the blue light. The model was pre-trained on the Speech Commands Dataset, which consists of 65, 000 one-second long utterances of 30 short words, crowdsourced online.
Convert model
covert trained model in to C, and paste it in to our micro_features_model.cpp
const unsigned char g_model[] DATA_ALIGN_ATTRIBUTE = {
0x20, 0x00, 0x00, 0x00, 0x54, 0x46, 0x4c, 0x33, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x12, 0x00, 0x1c, 0x00, 0x04, 0x00, 0x08, 0x00, 0x0c, 0x00,
0x10, 0x00, 0x14, 0x00, 0x00, 0x00, 0x18, 0x00, 0x12, 0x00, 0x00, 0x00,
0x03, 0x00, 0x00, 0x00, 0x94, 0x48, 0x00, 0x00, 0x34, 0x42, 0x00, 0x00,
0x1c, 0x42, 0x00, 0x00, 0x3c, 0x00, 0x00, 0x00, 0x04, 0x00, 0x00, 0x00,
0x01, 0x00, 0x00, 0x00, 0x0c, 0x00, 0x00, 0x00, 0x08, 0x00, 0x0c, 0x00,
0x04, 0x00, 0x08, 0x00, 0x08, 0x00, 0x00, 0x00, 0x08, 0x00, 0x00, 0x00,
0x0b, 0x00, 0x00, 0x00, 0x13, 0x00, 0x00, 0x00, 0x6d, 0x69, 0x6e, 0x5f,
0x72, 0x75, 0x6e, 0x74, 0x69, 0x6d, 0x65, 0x5f, 0x76, 0x65, 0x72, 0x73,
0x69, 0x6f, 0x6e, 0x00, 0x0c, 0x00, 0x00, 0x00, 0xd4, 0x41, 0x00, 0x00,
0x00, 0x00 ...... };
Deploy on Arduino
Deployed the model to our device and ran several tests. As shown in the following video, it shows that we successfully detected the keywords "yes" and "no".
How it works?1. Microphone on Arduino board gets the voice input data
2. Extracts features for the model
3. Run the model and output control command
4. Flashlights response according to the commands
Demo VideoResult
During the testing, the LED became green when we said “yes”; the LED became red when we said “no”; if we said other words, the LED would become blue. Besides, when we said two words in a short time, the LED would become two-color, for example: when we said yes and no at the same time, the LED would turn purple. And there is also have a problem, we have to say the word more emotionally, if we said the word "yes" in a plain tone, the LED would turn blue which mean unknown. I think the reason is that our training data need to include different tones to make the word detection more accurate.
Comments