SmartListener is a low-cost, Wi-Fi-enabled edge AI device that listens for real-world sounds — like a baby crying, dog barking, or glass breaking — and sends MQTT alerts when specific sound events are detected. It uses a PSoC6 AI Kit running a lightweight ML model trained using DeepCraft Studio and features a custom 3D-printed enclosure for deployment in any room.
Key Features:- Edge AI: Runs TensorFlow Lite models directly on the PSoC™ 6.
- 5+ Sound Classes: Detects baby_crying, fire_alarm, glass_breaking, footsteps, and dog_barks.
- MQTT Integration: Publishes JSON alerts to any broker (HiveMQ/Mosquitto).
- Low Latency: Real-time processing with minimal hardware.
ESC-50 is a free, well-labeled, and widely used dataset in the field of environmental sound classification (ESC). We used custom Python scripts to split, resample, and convert the dataset to match the input format expected by DeepCraft Studio.
🧾 Script 1: Split ESC-50 by Class
import pandas as pd
import shutil
import os
# Paths
ESC50_CSV = "./meta/esc50.csv"
ESC50_AUDIO_DIR = "./audio/"
OUTPUT_DIR = "./split_data/"
# Load CSV
df = pd.read_csv(ESC50_CSV)
# Create the base output directory if it doesn't exist
os.makedirs(OUTPUT_DIR, exist_ok=True)
# Loop through each unique category in the dataset
for category in df['category'].unique():
# Create a folder for each category inside the output folder
category_folder = os.path.join(OUTPUT_DIR, category)
os.makedirs(category_folder, exist_ok=True)
# Filter the filenames for the current category
category_files = df[df['category'] == category]['filename'].tolist()
# Copy each file into its respective category folder
for file_name in category_files:
src_path = os.path.join(ESC50_AUDIO_DIR, file_name)
dst_path = os.path.join(category_folder, file_name)
shutil.copy(src_path, dst_path)
print(f"Copied {len(category_files)} files to folder: {category_folder}")
print("Finished splitting the dataset!")🧾 Script 2: Convert to WAV + Resample to 16 kHz
from pydub import AudioSegment
import os
# ---- SETTINGS ----
input_folder = "raw_audio_files" # Folder where your audio files are
output_folder = "processed_audio" # Where converted files will be saved
target_sample_rate = 16000 # 16 kHz
target_format = "wav" # DeepCraft needs WAV
# ---- FUNCTION ----
def process_audio(file_path, output_path):
audio = AudioSegment.from_file(file_path)
audio = audio.set_frame_rate(target_sample_rate)
audio = audio.set_channels(1) # Mono
audio.export(output_path, format=target_format)
# ---- RUN ----
if not os.path.exists(output_folder):
os.makedirs(output_folder)
for root, dirs, files in os.walk(input_folder):
for file in files:
if file.lower().endswith((".wav", ".mp3", ".flac", ".ogg", ".m4a")):
input_path = os.path.join(root, file)
output_path = os.path.join(output_folder, os.path.splitext(file)[0] + ".wav")
print(f"Processing {input_path} -> {output_path}")
process_audio(input_path, output_path)
print("\n✅ All files processed!")These ensure each audio file is at 16 kHz, mono, 16-bit PCM — required for PSoC6 inference.
DEEPCRAFT StudioDEEPCRAFT™ Studio, by Infineon Technologies, is an end-to-end platform for developing and deploying Edge AI applications. It supports audio, radar, time-series, and computer vision data. With an intuitive, graph-based interface, it simplifies the machine learning workflow, from data collection to deployment. The platform also offers open-source Starter Models, enabling developers to quickly begin AI projects and deploy them to edge devices.
DEEPCRAFT Studio : Data LabelingBefore being able to train the model or even performing preprocessing, the data need to be imported to DEEPCRAFT studio so we can start be labeling it. We created a Generic Graph UX project where we add a "wav file" node so we can read the training data.
🎯 Selected Labels
- Baby crying
- Dog barking
- Glass breaking
- Door knock
- Background noise ("unknown")
- other events (unlabeled)
After labeling all that data that will be used to train the model, the data need to be imported to the DEEPCRAFT classification project. Then we need to split the data into three categories Training Dataset, Validation Dataset, Test Dataset.
The raw audio recordings were preprocessed using a carefully crafted feature extraction pipeline designed in DeepCraft Studio. The input audio clips were mono-channel PCM WAV files sampled at 16 kHz, and the preprocessing chain was optimized for both real-time inference and classification accuracy.
- Sliding Window : This step segments the incoming audio stream into overlapping windows of ~32 ms for frame-based processing.
- Hann Smoothing: Reduce spectral leakage by applying a Hann window.
- Real Discrete Fourier Transform (RDFT) : Converts the time-domain signal into the frequency domain using a real-valued FFT.
- Frobenius Norm : Reduces the FFT output into a single power spectrum per frame using Frobenius norm.
- Mel Filterbank : Converts the power spectrum to a perceptually inspired Mel scale representation — compressing high-dimensional FFT into meaningful 30-bin frequency features.
- Clip : Ensures numerical stability by clipping extreme values
- Logarithm : Applies log-compression to simulate human loudness perception and further normalize dynamic range.
- Final Sliding Window (Spectrogram Block) : creates spectrogram "snapshots" that are passed into the neural network for classification.
To train the audio classification model, I used DeepCraft Studio’s modelwizard, which streamlines neural network architecture selection and tuning. The model was optimized for deployment on the PSoC6 AI Kit with a focus on size, latency, and generalization.
The trained Conv1DLSTM model achieved 76.13% accuracy and a comparable F1-score of 76.08%, demonstrating solid performance for a lightweight architecture. The confusion matrix reveals strengths in classifying distinct sounds like fire (10.25% correct) and baby_crying (11.32% correct), though challenges arise with ambiguous or rare classes. For example, glass_breaking was correctly identified only 1.27% of the time, often misclassified as unknown (1.91%), likely due to limited training examples or subtle acoustic features. The model’s tendency to label 39.46% of samples as unknown suggests conservative confidence thresholds or noise sensitivity, while confusion between dog and baby_crying (1.37%) hints at overlapping frequency patterns.
To improve robustness, future work could focus on data augmentation for underrepresented classes (e.g., synthetic glass_breaking samples) and adjusting confidence thresholds to reduce unknown predictions. Despite these challenges, the model outperforms a baseline CNN (~68% accuracy), making it a promising candidate for edge deployment with further optimization.
The DeepCraft Studio was configured to generate optimized C code (model.h/model.c) for deploying the audio classification model on Infineon's PSoC 6 microcontroller. Targeting the dual-core Cortex-M4/M0+ architecture, the setup enables CMSIS-accelerated floating-point operations while maintaining full precision (Float32) without quantization.
Deploying a machine learning model on the PSoC™ 6 AI Kit involves integrating the trained model into an embedded application using Infineon's development tools like ModusToolbox™ and DEEPCRAFT™ Studio. The model is embedded into the firmware running on the PSoC 6’s dual-core ARM Cortex-M4F/M0+ microcontroller. Audio data from the onboard peripherals MEMS microphone is collected in real time and passed to the model for inference directly on the device, enabling low-latency, on-device AI processing.
To enable communication, the PSoC 6 AI Kit uses the AIROC™ CYW43439 module for Wi-Fi and MQTT connectivity. By integrating the MQTT client library within application, the kit can publish inference results to an MQTT broker, making it ideal for edge-to-cloud use cases. This setup allows our application to perform inference locally and send results wirelessly, achieving an efficient balance between edge intelligence and connected communication.
Deployment : PSoC 6 AI Kit OverviewThe PSoC™ 6 AI Evaluation Kit (CY8CKIT-062S2-AI) by Infineon Technologies is a compact development platform for edge AI applications. It features a dual-core ARM Cortex-M4F/M0+ microcontroller, integrated sensors like a 6-axis motion sensor, radar, and a MEMS microphone, as well as wireless connectivity via the AIROC™ CYW43439 Wi-Fi/Bluetooth® combo module. Compatible with DEEPCRAFT™ Studio, the kit supports machine learning model training and deployment, making it ideal for applications like smart home devices, wearables, and industrial monitoring.
The Project contains an OS task that will be used to read the PDM (mic) input and pass it to the model in order to do the preprocessing and the classification.
/* Initialize the audio_buffer to zeroes and read data
* from the pdm mic into it */
audio_count = AUDIO_BUFFER_SIZE;
memset(audio_buffer, 0, AUDIO_BUFFER_SIZE * sizeof(uint16_t));
result = cyhal_pdm_pcm_read(&pdm_pcm, (void *) audio_buffer, &audio_count);
/* Convert integer sample to float and pass it to the model */
sample = SAMPLE_NORMALIZE(audio_buffer[i]) * DIGITAL_BOOST_FACTOR;
result = IMAI_enqueue(&sample);The inferencing task will check the result for the model and do some debouncing before sending the detected label to the MQTT task to be published.
if (best_label == last_seen_label)
{
debounce_counter++;
}
else
{
debounce_counter = 1;
last_seen_label = best_label;
}
// Confirm label after debounce threshold
if (debounce_counter >= DEBOUNCE_THRESHOLD && confirmed_label != best_label)
{
confirmed_label = best_label;
debounce_counter = 0; // Reset to avoid retriggering
// Filter out "unlabeled" and "unknown"
if (confirmed_label != 0 && confirmed_label != 6)
{
printf("✅ Debounced Output: %-30s\r\n", label_text[confirmed_label]);
// 🔔 Send to MQTT or trigger action here
publisher_q_data.cmd = PUBLISH_MQTT_MSG;
publisher_q_data.data = (char *)label_text[confirmed_label];
xQueueSend(publisher_task_q, &publisher_q_data, 0);
}
else
{
printf("⛔ Ignored Label (Unlabeled/Unknown): %s\r\n", label_text[confirmed_label]);
}
}Deployment : MQTT CommunicationThe MQTT communication used the following configurations to select the broker (HiveMQ) and the topic used by the publisher
#define MQTT_BROKER_ADDRESS "broker.hivemq.com"
#define MQTT_PORT 1883
#define MQTT_PUB_TOPIC "SmartListener"Once a label is confirmed by the model, it will be sent to MQTT task to publish it.
/* Wait for commands from other tasks and callbacks. */
if (pdTRUE == xQueueReceive(publisher_task_q, &publisher_q_data, portMAX_DELAY))
{
.
.
.
case PUBLISH_MQTT_MSG:
{
/* Publish the data received over the message queue. */
publish_info.payload = publisher_q_data.data;
publish_info.payload_len = strlen(publish_info.payload);
printf("\nPublisher: Publishing '%s' on the topic '%s'\n",
(char *) publish_info.payload, publish_info.topic);
result = cy_mqtt_publish(mqtt_connection, &publish_info);
break;
}
}Deployment : Final TestingAfter flashing the Board with code, and connecting the Li-Ion battery we can put the device the custom-made enclosure and let it listen to the surrounding environment to start detecting the pre-defined sounds and communicate this the cloud.
We can check what the board is detecting using any MQTT service. In my Case I use "IoT MQTT Panel" application to be able know what is happening around the board when I'm out off Home.
.














Comments