How Does This Sound?

GeoCLAP uses a contrastive learning approach to construct soundscape maps that predict the sounds one will encounter in a particular area.

Nick Bild
2 years agoAI & Machine Learning
Creating a soundscape map from multiple sources of information (📷: S. Khanal et al.)

In the quest to endow machines with the capacity to comprehend and navigate their surroundings, visual cues have traditionally been given precedence as a primary source of information. Computer vision, the field dedicated to enabling machines to interpret and make decisions based on visual data, has made significant progress in recent years. However, this exclusive focus on the visual domain tends to overlook a fundamental aspect of human perception: the importance of sound.

Sound, a rich and nuanced source of environmental information, has the potential to elevate machine understanding to a new level. In contrast to vision, which provides a snapshot of the immediate area, sound has the unique ability to bridge spatial gaps. For example, the distant hum of vehicles can reach our ears from blocks away, alerting us to the presence of automobile traffic even before it comes into view. Or consider how we can hear the sound of breaking waves long before the expanse of the ocean comes into view. This preemptive auditory information allows us to make anticipatory decisions, a capability that is often absent when relying solely on visual cues.

By embracing sound as a complementary modality, machines can better grasp the dynamic and evolving nature of the world, mirroring the multisensory depth that characterizes human perception. Engineers at Washington University in St. Louis took notice of the advances that have been taking place in machine learning recently, and came up with a plan to leverage sound in a new way. Specifically, they developed a system called Geography-Aware Contrastive Language Audio Pre-training (GeoCLAP) that can map soundscapes and predict the most probable sounds to be heard at a particular geographic location.

Existing systems of this sort generally work by crowdsourcing annotation of the sounds people notice in their surroundings. While valuable information can be collected in this way, this approach tends to focus only on heavily trafficked areas of the world, and is also limited by the set of descriptors that the annotators choose to use. To overcome these shortcomings of present approaches, the team instead chose to train a model from three sources of data, namely geotagged audio, a textual description of the soundscape, and overhead images of the area.

The SoundingEarth dataset, consisting of more than 50,000 geotagged audio recordings from around the planet, paired with 1024 x 1024 pixel overhead images was leveraged in this work. Combined with textual descriptions of the audio in each sample, the data was used to train a contrastive learning algorithm. In this way, a model was trained to learn the shared embedding space between overhead images, sounds, and textual descriptions.

Through the use of these methods, GeoCLAP was given the ability to create a soundscape map for any geographic area. As such, the algorithm can predict what sounds one would be most likely to hear at any given location on the planet.

When compared with existing state of the art techniques, GeoCLAP proved to be a significant step forward. For image-to-sound and sound-to-image retrieval tasks, a gain of 55.71% and 57.95% was seen respectively when considering 100% recall capabilities.

Using the global, high-resolution soundscape maps constructed by GeoCLAP, future intelligent systems may gain a more comprehensive understanding of their surroundings.

Nick Bild
R&D, creativity, and building the next big thing you never knew you wanted are my specialties.
Latest articles
Sponsored articles
Related articles
Latest articles
Read more
Related articles