What is our project about?
- Motion Box is a 5-track, motion-triggered music sequencer that blurs the line between a musical instrument and a fitness activity. It allows anyone to compose and perform electronic music through natural body movements—waving hands, stomping feet, and physically "swapping" sound palettes using RFID tags. Unlike traditional MIDI controllers, Motion Box runs entirely in a web browser via the Web Serial API, requiring no complex DAW (Digital Audio Workstation) setup or plugins. It’s a plug-and-play hardware-to-web experience designed for interactive art, education, and experimental performance.
The Inspiration?
- The idea came from a high-energy party: people are dancing, the DJ is tweaking synthesizers, and the music is driving the room. We thought—what if the dancers were the DJ? The Goal: We wanted to bridge the gap between the audience and the music. Traditionally, people dance to the beat, but with Motion Box, they create the beat. By turning hand waves and foot stomps into real-time triggers, we’ve transformed the body into a live music synthesizer. It’s built to make the "party" more interactive—where every dance move adds a layer to the track, and switching a physical RFID tag is like a DJ swapping a vinyl record.
How does it work?
- Motion Box operates as a synchronized ecosystem of hardware and web technology. At its core, an ESP32 microcontroller manages a hybrid sensor network to capture full-body expression: two MPU6050 IMUs strapped to the hands track sharp spatial acceleration via I2C (using addresses 0x68 and 0x69) to trigger snares and hi-hats, while two Force Sensitive Resistors (FSRs) tucked under the feet detect tactile stomps for kick and tom drums. To add a layer of physical interaction, an RC522 RFID reader scans "material tags"—each with a unique UID—to instantly remap the entire sound engine’s timbre to styles like Glass, Wood, or Metal. All this raw data is processed locally on the ESP32 and streamed via USB Serial to the browser. There, a real-time sequencer grid (inspired by BeepBox) intercepts the stream using the Web Serial API, translating physical movement into instant visual feedback and high-fidelity audio without the need for any external MIDI software or plugins.
High Level Architecture
This architecture is split cleanly into two main computation domains:
- Embedded domain: ESP32 firmware handles sensor polling, thresholding, and telemetry output.
- Browser domain: the UI handles parsing, event interpretation, sequencing, sound generation, visualization, and audio export.
Hardware and Wiring
The hardware layout is intentionally simple and centered on the ESP32 as the single controller. Both hand-mounted MPU6050 modules share one I2C bus, with `GPIO 21` used for SDA and `GPIO 22` used for SCL. The two sensors are distinguished by address rather than by separate wiring: the left-hand unit uses address `0x68` with AD0 tied to ground, while the right-hand unit uses address `0x69` with AD0 tied to `3.3V`. This allows both IMUs to operate on the same bus while still being read independently by the firmware.
The two foot-mounted FSRs are connected as analog pressure inputs on `GPIO 34` and `GPIO 35`. Each FSR is used with a `10k` pull-down resistor so that the ESP32 sees a stable low value when no force is applied and a higher voltage when the sensor is pressed. In practice, these two channels act as left-foot and right-foot stomp detectors for the low-percussion tracks.
The RC522 RFID reader is connected over SPI, using `GPIO 5` for slave select, `GPIO 18` for clock, `GPIO 19` for MISO, `GPIO 23` for MOSI, and `GPIO 4` for reset. All sensors and modules share the ESP32 `3.3V` supply and a common ground. Functionally, the hands provide motion-based percussion triggers, the feet provide pressure-based triggers, and the RFID reader provides a physical way to switch the sound material of the instrument.
Firmware
The ESP32 firmware in initializes I2C, SPI, serial, and sensor inputs, wakes both MPU6050 devices, and then loops continuously at about 50 Hz. For each IMU, it reads accelerometer and gyroscope values, scales them, and combines them into a single `motion` score:
motion =
|gx| + |gy| + |gz|
+ 20*|ax|
+ 20*|ay|
+ 20*|az - 1.0|That score is thresholded into `swing` and `shake` flags. The FSR channels are sampled with `analogRead()` and compared against a threshold of `1200` to derive `pressed` states. The RFID reader detects tags, converts their UIDs to uppercase colon-separated strings, and applies a `1000 ms` cooldown to avoid repeated triggers. Each loop iteration emits one JSON line containing both IMU states, both foot states, and RFID status.
Browser Logic and Sequencing
The browser controller opens the serial connection, parses each JSON line, and converts sensor changes into sequencer events. The sequencer has five tracks: a locked melody track plus right hand, left hand, left foot, and right foot tracks. The grid runs across four measures and supports `4/4` with `16` steps per measure or `6/8` with `12` steps per measure.
Sensor input is quantized to the current playhead position. The browser uses rising-edge detection so that a note is inserted only when a sensor changes from inactive to active. Right-hand motion writes to the snare track, left-hand motion to the hi-hat track, left-foot pressure to the kick track, and right-foot pressure to the bass-tom track. Once a trigger is accepted, the corresponding step is set active, the sound is previewed immediately, and the grid is re-rendered.
RFID and Audio
RFID tags act as tangible timbre controls rather than identification devices. When a known UID is detected, the browser maps it to a material preset, updates all track material selectors, plays a confirmation sound, and updates the RFID banner and log. This preserves the rhythmic pattern while changing the synthesis behavior of the whole instrument.
Audio is generated with the Web Audio API. The default mapping is melody, snare, hi-hat, kick, and bass tom, while the Glass, Wood, and Metal modes switch the synthesis recipe for each track rather than applying a simple effect. The melody uses a D pentatonic note set across two octaves, and playback scans active cells at the current playhead using a browser-timed loop. For export, the system renders the full pattern through an `OfflineAudioContext` and encodes the result as a stereo WAV file.
Video
Reference











Comments