This project uses a Jetson AGX Orin to run multiple models together as a storytelling agent. I used this for a hackathon that was run by OpenAI for utilizing their model.
The major elements to this are:
- Seeed's MIDI Synthesizer device is used as the remote over serial. Two buttons are used for control: one to start and stop the audio input state and the other to exit the story early. Additionally the story itself has cues added to trigger various predetermined repeating sounds based on the tone for the story (ie action elements or mystery).
- USB microphone for recording during the initial story setup
- Faster whisper for translating the audio captured to text for model prompting
- A projector for displaying the story and playing the audio.
- Jetson Orin AGX for running the various models.
- ComfyUI and SDXL (with HyperSD Lora) for creating visuals
- Piper TTS for generating wavs to play for the audio components
- OpenAI's opt-gss-storyteller for creating the story from the prompt
I struggled to come up with a good use for the oss-gpt-20b that would make it stand out in a hackathon. At its core it's just another LLM and while it is pretty powerful I felt just utilizing it as a simple local chat bot wouldn't be enough to make my entry stand out.
My children love stories and over the years we've collected several devices where you have an image disk and project out some frame of a story. Additionally years ago Google Home had a feature where it would read along and play audio cues for Golden Book stories but they killed it. I thought I would combine the two to create an interactive storytelling experience where my children could shape the direction of the storyteller.
πΆ Seeed MIDI SynthesizerOne of the first struggles I came to when building this was to have some mechanism to trigger the microphone on the device. I've used wake words for previous projects like Infinite Sands but noticed without careful calibration it can cut you off too early or just keep recording long after you've attempted to go silent to indicate you're done talking (especially a problem with kids making a lot of noise in the background). Then it hit me, instead of trying to layer background noises with the text being spoken I could use my MIDI Synthesizer as a local audio player for sound effects and also rely on its existing buttons for triggering behavior.
The MIDI Synthesizer is a handheld device with a built-in speaker, an ESP32C3 and has 4 buttons. Additionally given it's USB is connected to the ESP32C3 you can use it for serial communication with a host device.
You can see from the schematic of the device there are four available buttons that are connected on D0, D1, D2, and D3.
The device communicates to the SAM2695 MIDI IC using the TX pin on the Xiao ESP32C3 / over serial but luckily USB serial can be done in tandem for this board allowing you to still communicate.
I used platform.io to flash the device with my code. In my repository you can see the code for the wand under the StorytellerWand directory.
Once that's setup you can experiment with pyserial to confirm everything is working as intended. I included a demo script that can be run for this very purpose. The demo script goes through the start and end recording chimes, each mood and then listens for button presses. I used this during my development to confirm everything was ready for my actual storyteller script.
π₯· Further preparationOnce you have the Jetson Orin AGX setup with Jetpack you'll likely need to ensure you have an NVMe attached as otherwise you'll run out of room very quickly downloading various models for machine learning. I set mine up as `/mnt/nvme` with a workspace under that `/mnt/nvme/workspace`. If you are following along to do something similar but use a different directory you may need to modify my script.
The setup process here assumes the files will be placed directly in the workspace directory and ComfyUI is setup in the folder under it.
Add yourself to the dialout group for serial access:
sudo usermod -a -G dialout $USER
newgrp dialout
Ensure storyteller.sh is set to be executable:
chmod +x storyteller.sh
Clone ComfyUI: git clone https://github.com/comfyanonymous/ComfyUI.git
Inside the ComfyUI folder create a venv:
python -m venv venv
source venv/bin/activate
pip install --pre torch torchvision torchaudio --index-url https://pypi.jetson-ai-lab.io/jp6/cu126
pip install -r requirements.txt
You'll want to install some additional plugins: ComfyUI Manager - makes it easier to install nodes ComfyUI Tooling Nodes - provides API based nodes such as loading base64 image and sending images over a websocket
Model files for ComfyUI: SDXL - should be placed in the ComfyUI/models/checkpoints folder Hyper SDXL 2 Steps LoRA - should be placed in the ComfyUI/models/loras folder
In the main workspace file also create a venv there:
python -m venv venv
source venv/bin/activate
pip install --pre torch torchvision torchaudio --index-url https://pypi.jetson-ai-lab.io/jp6/cu126
You'll need some other libraries (I believe I haven't missed any):
sudo apt-get install libgtk2.0-dev
sudo apt-get install portaudio19-dev
sudo apt install screen
sudo apt install git
sudo apt install ffmpeg
pip install soundfile
pip install screeninfo
pip install websocket-client
pip install flask
pip install openai-whisper sounddevice numpy
pip install --pre torch torchvision torchaudio --index-url https://pypi.jetson-ai-lab.io/jp6/cu126
pip install transformers
pip install opencv-python-headless
pip install soundfile
pip install opencv-python pillow websocket-client requests pyserial numpy
pip install wheel setuptools pybind11 packaging
pip install transformers
wget https://developer.download.nvidia.com/compute/cudss/0.6.0/local_installers/cudss-local-tegra-repo-ubuntu2204-0.6.0_0.6.0-1_arm64.deb
sudo dpkg -i cudss-local-tegra-repo-ubuntu2204-0.6.0_0.6.0-1_arm64.deb
sudo cp /var/cudss-local-tegra-repo-ubuntu2204-0.6.0/cudss-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cudss
ollama install:
curl -fsSL https://ollama.com/install.sh | sh
ollama model location modification:
https://www.reddit.com/r/ollama/comments/1c4zg15/does_anyone_know_how_to_change_where_your_models/
sudo vim /etc/systemd/system/ollama.service
Environment="OLLAMA_MODELS=/mnt/nvme/workspace/ollama"
systemctl daemon-reload
systemctl restart ollama
export OLLAMA_MODELS=/mnt/nvme/workspace/ollama
nohup ollama serve > ~/ollama.log 2>&1 &
Testing ollama with gss-oss-20b:
curl http://localhost:11434/api/generate \
-d '{
"model": "gpt-oss:20b",
"prompt": "You are a magical and friendly storyteller for children between the ages of 4 and 7. Your purpose is to take a simple idea from a child and turn it into a short, enchanting, and positive story. Story is about a dog.",
"stream": true
}'
Faster Whisper - the pip install script for Faster Whisper installs the CPU version of it for the jetson. If you want it to transcribe faster you'll need to build CTranslate2 from source.
sudo apt-get update
sudo apt-get install -y libcudnn9-dev-cuda-12
sudo apt-get install -y git cmake ninja-build build-essential python3-dev
cd /mnt/nvme/workspace
git clone --recursive https://github.com/OpenNMT/CTranslate2.git
cd CTranslate2
mkdir build && cd build
cmake .. \
-DCMAKE_BUILD_TYPE=Release \
-DWITH_CUDA=ON \
-DWITH_CUDNN=ON \
-DWITH_MKL=OFF \
-DWITH_OPENBLAS=ON \
-DCMAKE_CUDA_ARCHITECTURES=87 \
-DBUILD_CLI=OFF \
-DOPENMP_RUNTIME=COMP \
-G Ninja
ninja -j"$(nproc)"
cmake -DCMAKE_INSTALL_PREFIX=$HOME/.local .
ninja install
export CPLUS_INCLUDE_PATH=$HOME/.local/include:${CPLUS_INCLUDE_PATH}
export LIBRARY_PATH=$HOME/.local/lib:${LIBRARY_PATH}
export LD_LIBRARY_PATH=$HOME/.local/lib:${LD_LIBRARY_PATH}
export CMAKE_PREFIX_PATH=$HOME/.local:${CMAKE_PREFIX_PATH}
cd ../python
python setup.py bdist_wheel
pip install dist/*.whl
pip install faster-whisper
Note subsequent shell sessions will need to run the following unless added to your startup bash script (bash_profile, etc):
export CPLUS_INCLUDE_PATH=$HOME/.local/include:${CPLUS_INCLUDE_PATH}
export LIBRARY_PATH=$HOME/.local/lib:${LIBRARY_PATH}
export LD_LIBRARY_PATH=$HOME/.local/lib:${LD_LIBRARY_PATH}
export CMAKE_PREFIX_PATH=$HOME/.local:${CMAKE_PREFIX_PATH}
Piper TTS install (and voice):
sudo apt-get install -y alsa-utils curl jq
pip install --upgrade piper-tts
export VOICE_ID=en_US-lessac-high
export FAMILY=lessac
export QUALITY=high
BASE=/mnt/nvme/workspace/models/piper/voices/en/en_US/$VOICE_ID
mkdir -p "$BASE"
curl -fL --retry 3 -o "$BASE/$VOICE_ID.onnx" \
"https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/$FAMILY/$QUALITY/$VOICE_ID.onnx"
curl -fL --retry 3 -o "$BASE/$VOICE_ID.onnx.json" \
"https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/$FAMILY/$QUALITY/$VOICE_ID.onnx.json"
Testing Piper TTS:
{
echo "Good evening. Let's read a story together." \
| piper --model "$MODEL" --config "$CFG" \
--length_scale 1.12 --noise_scale 0.20 --noise_w 0.70 \
--output-raw
# 0.15s silence
dd if=/dev/zero bs=2 count=$(( SR * 15 / 100 )) status=none
} | aplay -q -D plughw:0,3 -r "$SR" -f S16_LE -c 1 -t raw -
Additional Test: microphone-test.py - Used to find and test the attached USB microphone, uses faster whisper for transcribing
π Running the ScriptsI opted to run the script over SSH. I did this as it allowed me to monitor the status of the device via its logging and I didn't want to modify the Jetson AGX Orin to autostart into the script or have to attach a keyboard/mouse. For my device I set it to not go to sleep, use max power, and auto login to my user on start. As such I just needed to bring the device to the room to showcase it, plug it in, and turn it on.
From there I ran the following:
nohup ./storyteller.sh &
screen -S display_session
source venv/bin/activate
export DISPLAY=:0
export CPLUS_INCLUDE_PATH=$HOME/.local/include:${CPLUS_INCLUDE_PATH}
export LIBRARY_PATH=$HOME/.local/lib:${LIBRARY_PATH}
export LD_LIBRARY_PATH=$HOME/.local/lib:${LD_LIBRARY_PATH}
export CMAKE_PREFIX_PATH=$HOME/.local:${CMAKE_PREFIX_PATH}
export OLLAMA_MODELS=/mnt/nvme/workspace/ollama
nohup ollama serve > ~/ollama.log 2>&1 &
python storyteller.py
The first command starts ComfyUI in the background (using its own venv we setup earlier). You can test that yourself by forwarding the ComfyUI port over ssh with `-L localhost:8188:localhost:8188` and then access ComfyUI at `http://localhost:8188` on your connecting machine.
Screen then connects to the display session for the server, it opens its own console so venv is activated there, and the display is set to 0.
After this I exported the variables used by CTranslate2, exported the OLLAMA_MODELS model location (ollama wasn't starting automatically for me) and I set it up in the background to act as an API.
Then it runs the script in that venv.
Note: it's a bit slower than the attached video leads on. It takes a bit for gss-opt-20b to finish thinking so there may be a delay while it sorts that out before it does the story. The actual generation is pretty quick though and places the story files in a tmp folder in the workspace so it can access them as it works through the story.
Prompt Used:
You are a magical and friendly storyteller for children between the ages of 4 and 7. Your purpose is to take a simple idea from a child and turn it into a short, enchanting, and positive story.
YOUR PERSONALITY:
You are gentle, kind, and always encouraging.
You use simple words and short, clear sentence structures.
Your stories always have a positive moral or a happy ending.
You are very imaginative and love whimsical details.
SPECIAL COMMANDS:
You have two special abilities. You must use these commands on their own lines to control the story's music and illustrations.
[MOOD: MoodName]: When the feeling of the story shifts, you must insert a mood tag. This tag controls the background music. The available moods are: Happy, Sad, Sneaky, Mysterious, Action, Calm.
[IMAGE: A detailed description of the entire scene]: When the scene changes significantly and a NEW illustration is needed, you must insert an image tag. This description must be a complete, detailed prompt for an AI artist, describing the setting, the main character, and the action. Names should not be used but rather enough information to generate a given character soley off a description.
CURRENT STORY REQUEST
The child's idea for the story is: "{user_prompt}"π‘ Closing Thoughts
I thought the project was a neat one to put together. My children really love it and have asked to use it a few times since. I do think there are some rough edges around it that need to be improved to make it better which I've documented below. I'm not sure if it's good enough to win a hackathon but it was my best effort given time constraints so I'm pleased with how it turned out.
If you try to get it working yourself and run into any issues feel free to write in the comments below and I'll try to help out; I think I included all of the library imports I used in the process but I did go back and forth a bit so it's possible I may have missed one or two but I'm happy to assist.
π§ FutureI regret I wasn't able to polish the project as much as I would have liked; I was working pretty much up until the deadline for the hackathon. As such there were various elements I would have liked to adjust and may in future revisions.
- The Piper TS cuts off words at the end on occasion / it's not as smooth sounding as I'd have liked.
- Additional pause between pages would have been helpful.
- Using stories to fine tune the model and ensuring it created better image descriptions.
- More experimentation with other image generation / using a more modern image model with less artifacts like Flux (albeit speed was an important element to this setup).
- Faster-whisper: MIT License | URL
- SDXL: Stability AI CreativeML Open RAIL++-M License | URL
- Hyper SD: Bytedance Inc. License | URL
- Piper TTS: MIT License | URL
- CTranslate2: MIT License | URL
- Ollama: MIT License | URL
- gpt-oss-20b: Apache License v2.0 | URL
Comments