Creating Images and Videos with Sound by Using 96 MEMS Microphones
Learn how an image can be generated by processing the sound from 96 microphones in an array.
The Idea
As part of his senior capstone project for the ECE program at Carnegie Mellon, student John Duffy wanted a way to locate objects in a room by using sound. He intended it to be a platform for messing around with beamforming and general phased array math, as well as a way to experiment with signal processing. This project is able to read sound data and convert it into a 9 x 13 image.
Designing the Hardware
The project's heart is an FPGA, which was chosen because it is able to handle the large amount of incoming data and process it. For sensing, sound, Duffy went with 96 MEMS microphones, which sends sound data as a pulse density modulation (PDM) wave. There are six hand-assembled panels that contain 16 microphones each, along with various other supporting components.
Processed sound data is then sent via Ethernet to a host PC for further analysis.
Digital Processing
As stated earlier, the FPGA takes in PDM microphone data by scheduling when each sensor should be read. That raw waveform is then added to other microphones and sent to a Java program running on a PC.
The challenge of processing is figuring out where in a 3D space a sound has come from based on when it hits each microphone. The team first used an algorithm that sums several distinct waveforms into a single waveform, which helps to block out external noise and also gives the angle of approach.
However, this method of determining direction is computationally inefficient, which is where the second algorithm comes into play: frequency domain beamforming. The math involved is quite complex, so make sure to watch this explainer video for more information on it. In summary, the digital waveform for each microphone is multiplied by the sine and cosine waves for a given target frequency. Next, the cosine wave is summed and treated as the real component of a complex number, and the sine wave is treated as the imaginary component. The target resolution is used to calculate what each pixel should represent in terms of degrees from the center. For example, a 12-pixel wide image with a field-of-view of 120 degrees would make each pixel correspond to 10 degrees. Then for each pixel, the distance between the microphone and the center of the array is calculated. This data is used to compute the vectors for microphone. Finally, the actual, received data array is compared to the expected complex array, which makes the pixel's complex number equal to the similarity between the two arrays.
Displaying an Image
Based on all of that complicated math from before, the image can be generated by going through and calculating each pixel's complex number, and then taking the magnitude of it. Colors are then mapped to these values, with blue denoting little correlation and red denoting high correlation.