Published August 19, 2025 © MIT

The Curious Frame: An Offline AI Tutor for Kids

This screen-free device uses a Jetson Nano Orin to identify everyday objects and explain them to curious kids in their own language.

BeginnerFull instructions provided2 hours869

The Curious Frame: An Offline AI Tutor for Kids

Things used in this project

Hardware components

NVIDIA Jetson Orin Nano Developer Kit

Flash Memory Card, MicroSD Card

Crucial P3 500GB PCIe M.2 2280 SSD

Raspberry Pi Camera Module V2

Raspberry Pi Zero

Jabra speak2 55 UC

Software apps and online services

Ubuntu

Story

Demo time

Like any parent of a toddler, I have a two-year-old who is endlessly curious about everything. To help satisfy that curiosity in a healthy, screen-free way, I wanted to create an AI-powered device that could act as a personal tutor, explaining the world on his terms.

The project, which I call "The Curious Frame, " had to meet a few key requirements to be successful for a child between two and seven years old:

No Screen: To avoid the negative effects of screen time on young children. All feedback had to be vocal.
Simple Input: Since a young child's speech is still developing and they can't read, the input method needed to be visual and intuitive.
Portable & Offline: It had to be usable anywhere, from the living room to the backyard, without relying on a Wi-Fi connection.
Multilingual: It needed to speak our family's language (French) but also be adaptable.

The solution is a handheld cardboard frame that the child points at an object. A camera, connected to a powerful single-board computer, analyzes what's in the frame, and a speaker provides a simple explanation.

To achieve this, I chose the NVIDIA Jetson Orin Nano for its offline AI processing power in a small form factor. For the "brains, " Google's Gemma3n model was the perfect choice. It's efficient enough to run on the Nano, it can handle translation tasks for providing French output, and it has been evaluated for child safety.

The final piece of the puzzle was the interaction. Instead of complex voice commands, the child simply "frames" an object of interest. As a bonus feature, I added a small, foldable French flag to the frame. When the camera sees the flag, the device switches its language from English to French, turning it into a fun language-learning tool.

This project was a journey of problem-solving, particularly in getting powerful AI models to run smoothly on a compact, offline device. I'm proud of the working prototype and excited to share how you can build one, too.

Schematics

The system architecture is composed of three main software services running on the Jetson Orin Nano, orchestrated by a central Python application.

Architecture showing the relationship between the curious_frame app, Ollama, Piper, and the hardware.

Sequence of Operations:

The main application loop follows these steps:

1. Capture Image: The curious_frame Python application captures a snapshot from the USB camera using OpenCV.

2. Detect Objects: The raw image is sent to the moondream2 model to find objects within the cardboard frame's view. (Note: I used moondream2 for this step because I was unable to create a quantized Gemma3n model that supported image prompts and fit on the Nano's memory alongside the other services).

3. Check for Language Flag: The application checks if a French flag is among the detected objects. If so, it sets the output language to French.

4. Generate Description: For the first two objects found, the application asks the gemma-3n model (running in Ollama) to describe the objects and their use.

5. Translate (if needed): If the language is set to French, the English description is passed back to gemma-3n with a translation prompt.

6. Synthesize Speech: The final text description is sent to the Piper TTS server, which converts it into a WAV audio file.

7. Play Audio: The application plays the generated WAV file on the USB speaker.

8. Loop/Timeout: The process repeats. If no new objects are detected for 10 minutes, the application stops.

Assembling the Project

Step 1: Building the Frame

The frame is the primary interface for the child. It helps guide the camera's focus.

1. Cut the Cardboard: Using the dimensions below, cut out the main frame from a piece of cardboard. You can adjust these dimensions to fit your materials as the software will adapt.

Cardboard Dimensions

2. Print and Attach the Flag: Print the French flag image provided in the project's GitHub repository. Glue it to a foldable tab on the frame so it can be hidden or shown to the camera at will.

Example of French flag attachment. The position can be changed as it will be detected using a Vision Language Model.

Step 2: Setting up the Jetson Nano Orin

1. Flash JetPack: Follow the official NVIDIA tutorial to flash JetPack 6.2.1 onto your microSD card and boot the Jetson Nano Orin for the first time.

2. Mount NVMe SSD: Install the NVMe SSD onto the board. Once the system is running, format and mount the SSD. Configure Docker to use the SSD for its data directory to save space on the microSD card.

3. Optimize RAM: The 8GB of RAM on the board is a key constraint. To manage it, create and activate a SWAP file on the fast NVMe SSD. Additionally, disable the desktop GUI to free up several hundred megabytes of RAM. See more tips there.

# To turn off the desktop GUI
sudo systemctl set-default multi-user.target
# To re-enable it if needed
# sudo systemctl set-default graphical.target

Step 3: Setting up the USB Camera

For this prototype, I converted a Raspberry Pi Camera v2 into a standard USB Webcam (UVC) using a Raspberry Pi Zero. This is a project in itself, and you can follow that excellent online tutorial to achieve this. Alternatively, any standard UVC-compatible USB webcam will work.

Code

The complete code, including Dockerfiles and setup scripts, is available in the project's GitHub repository.

Installation and Deployment

To make reproducibility as simple as possible, I have created a setup script that automates the entire software installation.

1. Clone the Repository:

git clone https://github.com/webscit/curious-frame.git
cd curious-frame

2. Run the Setup Script:

sudo ./setup.sh

This script will:

Install Ollama on the Jetson board.
Build the Docker containers for the Piper TTS server and the main curious_frame application.
Set up a systemd service to automatically start the application on boot.

3. Reboot: Once the script is finished, reboot the Jetson board. The Curious Frame application will start automatically.

Key Code Snippet

The core logic is in the src/curious_frame/main.py file. Here is a simplified look at the main loop:

# (Simplified for clarity)
from curious_frame.audio import Audio
from curious_frame.camera import Camera
from curious_frame.language import Language
from curious_frame.vision import Vision

def main_loop():
    camera = Camera()
    vlm = Vision()
    llm = Language()
    tts = Audio(
        language_model=llm
    )
    language = "en"  # English
    while True:
        frame = camera.get_frame()
        objects = vlm.find_objects(frame)
        if "french flag" in objects:
            language = "fr"  # French
        else:
            language = "en"

        if objects:
            description = llm.chat(", ".join(objects))

            # `audio.speak` will translate the description if needed
            # It will also take care of playing the sound.
            audio.speak(description, language)

This snippet shows the orchestration: capturing an image, detecting objects, setting the language, generating the description, and finally converting it to speech after translating it optionally.

Final Thoughts and Next Steps

I am thrilled that this prototype successfully fulfills the initial vision. The biggest challenges were entirely performance-related: first, finding a way to get good object detection from the camera feed, and second, squeezing the various AI models and services onto the Jetson Nano's 8GB of RAM. The switch from a single-model (Gemma3n with vision) to a dual-model approach (moondream2 + Gemma3n) was the key breakthrough.

With more time, I would focus on two main improvements:

1. Consolidate to a single Gemma3n model: As more efficient, quantized vision-capable models become available, I would love to simplify the architecture back to using only Gemma3n.

2. Add "Action Cards": I imagine creating other cards the child could attach to the frame, with symbols that trigger different requests. For example, a "storybook" card could ask the LLM to generate a short story based on the objects in the frame, or a "gears" card could ask how the object is made.

This project was incredibly rewarding. If you find it interesting, please feel free to raise an issue or pull request on GitHub. Let's join forces and make it even better.