Published March 10, 2026 © CC BY-NC-SA

ESP32-S3 voice frontend that connects to live AI models

DIY voice input system for ESP32-S3 connected to an AI voice model. Integrates Wake Word, local scripts exec, MQTT, and direct GPIO control.

BeginnerFull instructions provided2 hours1,168

ESP32-S3 voice frontend that connects to live AI models

Things used in this project

Hardware components

esp32 s3

MAX98357A

INMP441

Software apps and online services

Espressif ESP-IDF

voice-assistant.io

Story

Short demo

What is My Project About?

I built this project as a lightweight middleware layer that connects modern AI voice models to a dynamic local function-calling engine. My goal was to provide a fully functional voice interface capable of interacting with the physical world, while keeping the execution logic, security, and network access strictly within my own local infrastructure.

I designed it specifically for home automation enthusiasts and makers. The system operates as a standalone node, meaning I can ask my assistant to turn on the lights or change the temperature, while keeping absolute control over the available function registry right on my device.

The Three-Tier Architecture

When designing the system, I focused on balancing ease of assembly with processing power. I broke it down into three key components:

1. Voice Input I radically simplified the hardware for DIY assembly. My build consists of an ESP32-S3 controller, an I2S microphone (INMP441), and an external speaker amplifier (MAX98357A). This module handles preliminary noise filtering, Wake Word detection, and audio capture. If you want a more professional build, I've also provided ready-to-use Gerber files for a custom printed circuit board (PCB) in the GitHub repository.

2. Processor (Local Server) The central brain of my system is the local voice-assistant binary. I compiled it to run without external dependencies on Windows, Linux, macOS, and Raspberry Pi. It manages audio streams, routes data, and processes the AI model's requests to trigger my custom automations.

3. Proxy Server To optimize traffic, I implemented a proxy server that compresses the audio stream down to 70–80 kbps. Additionally, it provides local Voice Activity Detection (VAD) using a lightweight AI model. This cuts out background noise before the data ever reaches the main LLM, which significantly saves computational resources and token costs.

How It Works: My Local Execution Pipeline

The core design principle I followed is straightforward: the AI model is the brain — the hands are local.

The model recognizes user intents and decides what function to call, while my local voice-assistant agent decides how to execute it. Out of the box, I added support for several powerful handlers:

WEBHOOK: Fires HTTP requests to local services (I use this for n8n, Node-RED, or Home Assistant).
MQTT: Publishes messages to my local MQTT broker.
GPIO: Provides direct pin control over the active client connection.
EXEC: Runs shell scripts directly on the host machine.

Beyond executing commands, I made sure the system supports reverse audio feedback. It can directly play uploaded WAV files or synthesize speech (Text-to-Speech) using the AI model's voice. You can even embed emotional markers (e.g., "urgent" or "cheerful") into the text to add vocal expression to the synthesized speech.

More details

Security and Two Operating Modes

Privacy was a huge priority for me, so security is baked into the core architecture. The AI model never gets direct access to the local network, MQTT broker, or GPIO pins. All traffic between my local controller and the AI is transmitted over an encrypted WSS/TLS channel, authenticated via unique device keys.

For deployment flexibility, I created two operating modes:

PROD Mode: This is my stable connection to enterprise models featuring web search, traffic compression, and server-side VAD. Billing is strictly per second of model usage, and I ensure data is protected from being used for model training.

DEV Mode: I use this primarily for hardware debugging. It allows a direct connection via free Google AI Studio keys, bypassing the billing system, but without traffic optimization and with compromises regarding privacy policies.

More details

Quick Setup

I wanted to make getting started as frictionless as possible, so I completely eliminated the need to fight with the Arduino IDE. Here is the streamlined workflow I came up with:

1. The Dashboard: Create an account at voice-assistant.io, generate an API key for your device, and configure a FunctionSet. This is where you tell the AI how to behave and what local tools it has permission to use.

2. The Local Server: Download the pre-compiled voice-assistant binary for your platform (Windows, macOS, Linux, or Raspberry Pi) from the GitHub releases. Just extract and run it—there are zero runtime dependencies to install.

Download link

3. Hardware Configuration: Clone my ESP32-S3 firmware repository and wire up your hardware. Run the configure_settings.sh script in your terminal. A simple text UI will pop up where you can drop in your Wi-Fi credentials, your new API key, and the IP address of your local voice-assistant instance.

4. Automated Flashing: Run the run_upload.sh script (or run_upload_firmware.sh if you want to use my pre-built image). The script automatically handles the toolchain, compiles the firmware, flashes your controller, and instantly opens a serial monitor so you can see the logs.

Once that’s done, just say the Wake Word and start talking!

Use coupon "HACKSTER" to get discount.

With best regards, Roman Zolotarev.

More details on voice-assistant.io

Github repository