Breaking the Sound Barrier
Stability AI's newly released Stable Audio tool seeks to do for audio generation what Stable Diffusion did for image generation.
Generative AI, a subset of artificial intelligence, has made significant advancements in recent years, profoundly influencing various domains, particularly in the fields of image generation and conversational chatbots. This technology has garnered significant attention because of its ability to harness the power of deep learning algorithms to create content that closely emulates human-like patterns and creativity.
Present offerings are notably lacking in the availability of good generative audio tools. Sure, a number of options do exist, but they leave much to be desired. The current landscape often struggles to deliver high-quality and diverse audio content, frequently falling short in terms of naturalness, variability, and adaptability. This deficiency hampers the creative potential and practical utility of generative audio technology across industries including music, voice synthesis, and interactive media. As the demand for sophisticated audio generation continues to rise, there is a clear need for advancements in this area, pushing the boundaries of what generative AI can achieve in the auditory domain.
Stability AI, the company that helped to produce the wildly popular Stable Diffusion algorithm for image generation, has thrown their hat in the ring with a new tool called Stable Audio that was just introduced. Stable Audio leverages a diffusion-based generative model, of the same general type as the model used in Stable Diffusion, to produce high-quality audio clips of varying lengths. By supplying a text prompt, a user can create audio ranging from music to sound effects, and more.
In the past, using diffusion models for audio generation was challenging because they are trained to produce fixed-size outputs of the same size as the inputs. So, for example, if the model is trained on 20 second audio clips, it would only be able to generate 20 second-long outputs. Needless to say, that is a problem if you need to generate a full-length song.
In developing their new tool, Stability AI took a different approach that leverages text metadata, in addition to information about the duration and start time of an audio file. The resulting model architecture makes it possible to generate audio of varying lengths — within certain limitations, anyway. The maximum length of generated audio is still limited to the training window size. In the case of Stable Audio, the maximum length (for users paying $12 per month for the Pro plan) is 90 seconds, which is pretty reasonable, but falls short of being truly song-length. Users of the free service tier are artificially limited to creating audio clips of no more than 45 seconds.
A number of samples have been made available by Stability AI that are quite impressive. These high-quality clips are truly on-point in terms of respecting the user’s text prompt. The progress made by Stable Audio makes it easy to envision a future where tools such as this enable the development of all sorts of new creative applications.
There are some limitations of the tool, however. The previously mentioned restrictions on length will certainly limit what the tool can be used for. Moreover, the model was trained on a dataset of 800,000 audio files containing music, sound effects, and single-instrument stems. While this is a lot of information, it is not Internet-scale, as modern large language models are. So, you would not be able to, for example, ask the model to create a new song in the style of your favorite artist, because it has no concept of what your favorite artist sounds like.
Stable Audio is hot off the press, so to speak, so the website is dealing with heavy traffic. For the time being, you should expect any test you want to run to take quite a long time to complete. While the future direction of this project is unclear, it was noted that a 95 second, 44.1 kHz sample could be generated in one second on an NVIDIA A100 GPU, which makes it a highly accessible tool — should the developers choose to open it up to the world as they did with Stable Diffusion.