The technique of image processing has been greatly researched on and at this time developing object detection and image classification models is easier. For example, through AI platforms such as Edge Impulse, we can train and deploy an image classification model in under 5 minutes! Computer vision is not at the “how do we build this?” question anymore. Now, the real challenge is finding practical, useful ways to actually use these image processing AI models. As an embedded systems enthusiast, I wanted to push things further from image classification and object detection, by creating a model that can look at an image and describe it using natural language such as "No operator near the running machine", all at the edge on constrained hardware.
In the end, I was able to build a simple image to text model of 9MB in size, small enough to run on devices such as the Raspberry Pi (RPi). However, the end goal of this project is to push image to text onto tiny microcontrollers, unlocking new applications where these small devices shine. This can enhance spaces monitoring, automation, context-aware IoT devices, and a wide range of edge-AI experiments.
Researchers and tech companies have been exploring image captioning, a technique that employs computer vision and natural language processing to create meaningful captions for images. Due to its many potential uses, such as accessibility for the visually impaired, content production and comprehension for search engines, etc., image captioning is widely used and an important field of study.
The main objective of image to keywords is to generate captions that are not only grammatically accurate and meaningful, but contextually fitting and descriptive. This requires a complex process of identifying objects and their interrelationships within an image, followed by the generation of a cohesive description. An encoder is first used to extract visual features from an image. A visual transformer or a convolutional neural network (CNN) can be used as the encoder. Afterwards, through a recurrent neural network (RNN), long short-term memory (LSTM), or any other deep learning model intended to handle sequential data, a decoder then produces descriptive text from the encoded features.
In this project, we will explore retrieval-based image captioning to create a lightweight image to text model and use it to describe specific use case scenarios of:
- personal productivity monitoring: detect if a person is working or distracted by their phone
- smart space monitoring for energy conservation: detect if lights are on but there is no person in the room
- smart agriculture: monitor plant wellbeing through activities such as watering
The project uses two Python scripts, one for collecting training images with the Raspberry Pi camera, and the other for running the image to keywords model by collecting camera images and generating semantic texts about them. Training the model is done on Google Colab and afterwards Edge Impulse is used to deploy the entire inferencing pipeline (preprocessing and keywords prediction) to the Raspberry Pi. We will explore the following high-level parts of the project:
- Collecting images using a Raspberry Pi camera
- Preparing tags for each image
- Training a multi-label classification model in Google Colab
- Deploying the trained model to Edge Impulse
- Running real-time environment description with the Raspberry Pi camera
Unlike generative AI (GenAI) models such as ChatGPT Vision that require massive GPU clusters, complex training pipelines, and extensive computational budgets, this project adopts a far more frugal design. Retrieval-based image captioning is a technique that generates tags/captions for images by selecting and adapting pre-existing tags/captions from a list, rather than creating new ones. This approach leverages the idea that many images share similar features and contexts, allowing us to reuse tags for contextually similar images, especially on application-specific use cases. The technique is simple and depending on the architecture used, the resulting model can be lightweight and sufficient to run on resource constrained devices. However, this method is domain-specific, heavily relies on the pre-constructed tags, and they may not be correct for other deployment environments.
Each image is manually paired with descriptive tags, creating a lightweight but expressive dataset that avoids the complexity of full sentence generation. The dataset of images and a tags text file is then uploaded to Google Colab, where preprocessing pipelines convert all images into uniform tensors and transform the tags into multi-label vectors. Using transfer learning on MobileNetV2, a model is trained to associate visual features with the keyword vectors, effectively learning a high-speed retrieval mechanism that maps a new image to its closest set of learned descriptors (words). The resulting model is then uploaded to Edge Impulse enabling on-device performance analysis (flash usage, RAM usage, latency), model optimization and seamless deployment to any Edge AI hardware (including MCUs) that has sufficient resources to run it.
That means that:
- The model does not generate new sentences
- Instead, it detects keywords (tags) from a list that best describe what's in an image
- These predicted tags are then combined into a simple meaningful sentence
All the source codes for the project are available in this GitHub repository: lightweight_image_description. With everything running correctly, you will have a trained multi-label MobileNetV2-derived model that maps images to keyword vectors.
Also, special thanks to Edge Impulse for all their support during this project. Being able to seamlessly deploy custom AI models to wide range of hardware such as Arduino, ESP32, Raspberry Pi, etc., is incredible. Enabling image to keywords on these devices can open new pathways for experimentations and use case solutions. With their powerful tools and developer-friendly setup, Edge Impulse really stands out as the lead of Edge AI development.
01. Hardware setup and dataset preparationIn the project, the Raspberry Pi 4 with the 8 megapixel V2.1 camera module is used to capture images for training the model and running inference. However you can still use other devices for the same. Note that you will need to change the input data collection method in the Python scripts since they use the Picamera library to capture images.
Edge Impulse has a documentation to setup a RPi and install Edge Impulse for Linux SDK, a collection of tools that let you collect data from sensors, microphones, and cameras, and also run impulses with full hardware acceleration on your Pi.
Once your Raspberry Pi has been setup, use a CLI terminal with SSH connection to the Pi to download the project repository via: git clone, SCP transfer from your PC to the Pi, whichever option works for you. Next, run the command below to install the dependencies on the Pi:
pip install -r requirements.txtWe can now use the first Python video streaming webserver script to collect images. This script starts a webserver which streams live camera feed and has a 'Capture' button for saving frames as images. Note that if you are using a different hardware you can make use of other ways to collect the images. Start the streaming webserver UI with the command:
python camera_server.pyFrom the logs, copy and paste the webserver IP address address to the URL field of a web browser (mobile or PC). You will see live video feed coming from the RPi camera and a 'Capture' button to save images.
With respect to the domain-specific use case environment, create a dataset for training the model by clicking the 'Capture' button and the images will be saved to a 'saved_images' folder on the repository folder. In my case I setup my environment to show: a person working and using their phone, a small table-side plant being watered, and an empty room with no one but lights turned on.
Once images have been captured, we need to create a tags text file for each of the image files. These tags are words that best describe the image. I collected 150 images and manually assigned tags to each one. The following was the list of semantic tags: lights, no_person, person, phone, laptop, plant, watering. These tags are used to describe the mentioned specific use case scenarios of:
- personal productivity monitoring: detect if a person is working or distracted by their phone. Tags used are: lights,
no_person,person,laptopandphone. 60 images were captured for this case: 30 showing a person working on their computer while the remaining showing the person working but using their phone. - smart space monitoring for energy conservation: detect if are lights on but there is no person in the room. Tags used are:
lights,no_personandperson. 30 images were captured for this case. - smart agriculture: monitor plant wellbeing through activities such as watering. Tags used are:
plantandwatering. 60 images were captured for this case.
The dataset preparation step is almost complete. After collecting images and generating the tags text file, we need to copy them to a Google Drive folder that will be accessed by the Colab notebook to train the model. The directory paths for the images folder and text file are defined by the images_folder and image_tags_file variables in the notebook. Using a Google Drive folder is optional as you can directly upload your dataset folder to the notebook but Colab deletes uploaded files when the session expires. Also note that as an alternative to Google Colab, you can use Jupyter Notebook or run the training pipeline locally as a Python file, with dependencies installed on the host machine.
Before running the model training and deployment notebook, we need to first create an Edge Impulse project and obtain its API key. To do this, first create a new project in Edge Impulse Studio (or use an existing one) and navigate to the 'Keys' section under 'Dashboard'. Next, open the notebook and paste your key to the ei.API_KEY variable.
On the Google Colab notebook, click 'Connect' to start a hosted cloud session followed by 'Run all'. You will be requested to give access to your Google Drive.
The notebook starts by loading images and tags text file using the defined directory paths. Next the required libaries for loading data, preprocessing, model development and evaluation are then loaded using libraries such as NumPy, Pandas, and TensorFlow. Global variables and helper functions are then defined for preprocessing and building the model. The tags: lights, no_person, person, phone, laptop, plant, watering, are defined in a VOCAB list variable.
A function, preprocess_image, loads an image, resizes it to the fixed pixels size (128x128 as defined by IMAGE_SIZE), converts it to a NumPy array, and scales pixel values of 0 and 1. The notebook then reads the tags.txt into a pandas DataFrame. This file is expected to contain an image filename and space-separated tags for that image. This DataFrame becomes the ground truth annotation source for training. Afterwards, the function encode_keywords turns the space-separated tags text into a multi-hot binary vector where each index corresponds to a vocabulary word. This is the multi-label encoding step and since an image can have multiple tags simultaneously we can use a vector of independent binary outputs rather than a single categorical label (such as in image classification).
To easily identify class imbalance, the notebook computes a tag frequency and prints them so that we can identify which tags dominate and which ones are rare. This is essential for understanding dataset biases and deciding whether to collect more images for underrepresented tags or to rebalance during training.
Coming to the model architecture, I used a MobileNetV2 model, owing to it's great performance on feature extraction while requiring low computational cost and having a relatively small footprint. We freeze the base layers so that only the final added layers are updated during training (Transfer Learning technique) for faster training. A final Sigmoid layer with 7 outputs (matches number of defined tags in my case), makes the network to output a multi-label probabilistic tag predictor. Unlike SoftMax which forces the output probabilities to sum to 1, Sigmoid allows us to treat each tag independently. So for any given image, the model outputs X independent probabilities and considering the model's accuracy we can use a threshold to determine the top appropriate words describing the image. This means that the model does not try to classify an image into only one label. Instead, it asks:
- Does this image contain lights?
- Does this image contain a plant?
- Does this image contain a person?
- Does this image contain a phone or computer?
For example, the model outputs X independent probabilities, such as: machine = 0.91, technician = 0.88, cow = 0.07, fish = 0.01, computer = 0.74.
Running the notebook will load the images and tags, preprocess them, train a model, and finally upload the model to the used Edge Impulse project. In my case, with 150 training images and 7 tags, the model was trained with 20 epochs and the resulting precision was 100%. For a relatively simple and controlled environment such as this one, I chose this performance to be acceptable. However, considering the diverse dynamics of real-world environment, there is need to train a model with more images and tagging for better performance.
After the model training, I developed two functions to test the model: predict_keywords and predict_top_keywords:
- predict_keywords: runs the model on an image and returns all keywords whose predicted probability exceeds the defined threshold.
- predict_top_keywords: returns the top highest probability keywords regardless of threshold.
During testing, the model was once given an image of an empty living room and the function predict_keywords returned 'lights' and 'no_person', whereas predict_top_keywords returned 'no_person', 'lights', 'plant', when requested to return the top 3 most probabilistic tags. Different applications require different behavior of conservative detection vs. always showing most probabilistic tags. I prefered the predict_keywords function and its logic is further reused by the inference code.
Further testing the model showed impressive results and I chose this as an acceptable performance for deploying the model.
04. Deploying the modelAfter testing the model, the notebook uses Edge Impulse Python SDK to profile the model and this enables us to get RAM, ROM and inference times of the model on a wide range of hardware from MCUs, CPUs, GPUs and AI accelerated boards, incredibly fascinating! We can see the model's performance estimates on the Raspberry Pi 4 in the image below. During this profiling, the model is uploaded to the Edge Impulse project.
You can clone my public Edge Impulse project using this link: Light-weight image description.
Note that the model will not and cannot be accurate when you test it since the training images were specific to my controlled environment.
Once the notebook has finished running, we can go to the used Edge Impulse project, and we will see 'Upload model' under 'Impulse design'.
We first need to configure some parameters in the Edge Impulse project. Click 'Upload model' and a new interface will open on the right side of the page. Here, we need to select 'Image' for the model input. Next, we select 'Pixels ranging 0..1 (not normalized)' for model output. Select 'Squash' for resize mode and 'Classification' for model output. Finally, in the output labels input box, enter the tags used to train the model following the order that they were given during training. Click 'Save model' to finish the configuration. On the Studio's snapshot below, we can see that the model will take up 9MB of flash storage and the predicted inference time is 106ms which is relatively near the actual on-device inference time of around 25ms.
We can also test the model directly on Edge Impulse Studio by clicking 'Choose File' under the 'Upload an image' section. In my tests the model also performs well and the tags probabilistic predictions are better visualized on the Studio's UI. For example, in the test image shown below, we can see that the model is able to accurately and confidently present the words (person, laptop, phone) describing an image showing a person with a laptop and phone on their hand. During the inference, we can set a threshold for the probability scores and construct a sentence using the words that meet the score.
On the Studio UI, navigate to 'Deployment' and select 'Linux (AARCH64)' from the 'Search deployment options'. This is the architecture of the Raspberry Pi 4 and you will need to select a different target based on your deployment hardware. Click 'Build' and the Studio will download an Edge Impulse Model (.eim) binary to your computer. EIM are native Linux and macOS binary applications that contains your full impulse created in Edge Impulse Studio. The impulse consists of the signal processing block(s) along with any learning and anomaly block(s) you added and trained. EIM files are compiled for your particular system architecture and are used to run inference natively on your system.
Unzip the downloaded file and copy the .eim file to the eim-modelfile folder on the Pi. Next, on the CLI terminal enter the following commands to make the .eim file executable by the system:
cd eim-modelfile
chmod u+x ./modelfile.eim
cd ..Finally run the Python image description application with the command:
python ai_app.pyOnce the application starts, use a web browser (mobile or PC) to access the application and on the web page you will see live video feed coming from the camera as well as a textbox which shows the tags that best describe the video feed.
The idea of describing images has now been made possible through the optimized domain-specific image to keywords design. The deployment application captures frames from a Raspberry Pi camera and uses Edge Impulse Linux SDK to preprocess the images and generate the probabilities of how each tag relates with the given image. The top 3 probable words are then combined to form a sentence, which is then shown on a Web UI together with the video feed.
Considering that each tag uses its own output neuron, the model is completely free to output any combination whether or not we trained with that specific combination. As a result, the model is possible to recombine learned features and generate new keywords combination. During the deployment, I observed this case when the model was shown an image of a living room with a plant. It was incredible to see that from the training, the model understood all the combinations of the train images and be able to describe that there is a plant and also no person in the image.
This project demonstrates how effective, and practical image understanding can be when designed with efficiency in mind. Instead of classifying an image into one label (such as: car, person, dog, hardhat, etc.), we can enable computers to describe what they are seeing.
However, most of the image captioning models are expensive, do not enhance global sustainability, and most of the time not sensible for edge AI applications. In this case, by leveraging on domain-specific image to keyword models, we can reduce the required training and deployment resources since the computational data is significantly reduced. As technology advances, the future of image captioning aims to improve computer vision and understanding, allowing AI systems to interact with the environment in more meaningful ways.






Comments