Build Your Own Fully Offline AI Assistant

A multimodal AI assistant that runs entirely offline to protect your privacy is within reach. All you need is a Raspberry Pi 5 and a camera.

Nick Bild
2 seconds agoAI & Machine Learning
A multimodal AI assistant (📷: Suhas Telkar)

Now that the dust has settled, people have had time to assess what the latest generation of AI tools is capable of. They may not fully live up to the hype, but even so, many find that these tools offer a lot of help with tasks like research and coding. However, continuously sending private information to a cloud-based service is something that few people are comfortable with.

This has led many people to wonder how we can get the benefits of these tools while maintaining our privacy. A common solution involves building an edge AI system that can run a pared-down algorithm completely offline. One such system was recently developed by hardware hacker Suhas Telkar. It is interesting because it not only runs on inexpensive hardware, but it also hosts a powerful multimodal AI assistant.

The device is built around a Raspberry Pi 5 with 4GB of RAM. Rather than relying on cloud APIs, the assistant performs all computation locally. A quantized version of Google’s Gemma 3 4B Instruct model runs through llama.cpp, enabling conversational responses without exceeding the Pi’s limited memory. While performance is modest compared to desktop GPUs, the system can generate around 5 to 10 tokens per second, with first-token latency typically under eight seconds.

Voice interaction is handled entirely offline. Audio captured from a USB microphone is transcribed using Vosk, and responses are spoken back through the speaker using eSpeak. For visual intelligence, the assistant integrates YOLOv8 Nano, which analyzes images from a Raspberry Pi Camera Module. After initial model loading, object detection runs in just a few seconds, identifying items in the scene and announcing the first detected label.

A small 0.96-inch SSD1306 OLED display provides a simple user interface. Tokens stream to the screen in real time as the language model generates text, while idle animations give the device a personality when not in use. Interaction is entirely hardware-based: three physical buttons control push-to-talk conversation, object detection, and image capture. No keyboard, monitor, or terminal is required after startup.

To provide memory to the assistant, Telkar implemented a retrieval-augmented generation pipeline using ChromaDB along with the all-MiniLM-L6-v2 embedding model. Conversations and local knowledge files are embedded and stored, allowing the assistant to recall relevant context in future interactions. A rolling window limits stored entries to prevent uncontrolled growth in disk and RAM usage.

The entire project has been made available under a permissive MIT license. You can grab the source code from GitHub if you’d like to take it for a spin yourself.

Nick Bild
R&D, creativity, and building the next big thing you never knew you wanted are my specialties.
Latest articles
Sponsored articles
Related articles
Latest articles
Read more
Related articles