Taking On the Text-To-Video Efficiency Problem

Text-to-video AI is extremely computationally-intensive, but Hugging Face researchers think that future models can be made more efficient.

Their practical applications are not always entirely clear, but generative artificial intelligence (AI) tools are surging in popularity all the same. One of the most popular of these tools is the text-to-video (T2V) generator. With just a short textual description of what you would like to see, these algorithms can serve up a heaping helping of AI slop to fill every video hosting and social media site with trash that you’ll wish you could unsee.

Or at least that’s how it started, but not how it’s going. These tools have matured beyond the point of creating people with extra fingers and legs and unnatural motions, and now many of them produce very convincing results. And with this newfound realism, it is easy to see how these T2V models could be used to produce a short film on a tight budget or create a high-quality advertising campaign without a high-priced agency in New York. For small businesses and individuals, this could break down numerous longstanding barriers.

The architecture of WAN2.1-T2V-1.3B (📷: J. Delavande et al.)

But it’s not all roses and sunshine in the world of GenAI, my friends. Any time you start talking about video and AI in the same sentence, you can count on the computing and energy costs being huge. So these tools may be easy enough for anyone to use, but high costs still find a way to sneak in. In an effort to combat this problem, a pair of researchers at Hugging Face dug into existing T2V models. Their goal was to find the most computationally-intensive aspects of these algorithms to give researchers insights into how future tools can be made more efficient — and more accessible.

The study takes a close look at several state-of-the-art, open-source T2V systems, analyzing how long they take to render video clips and how much energy they consume in the process. The researchers first built a theoretical model to predict how performance should scale with three main factors: the resolution of the video, its length, and the number of denoising steps (the repeated refinement process that gives diffusion-based models their realism). Then, they tested those predictions on WAN2.1-T2V, one of the most popular open-source text-to-video systems available.

What they found was that the time and energy needed to produce a video clip grows quadratically with both spatial resolution and duration. This means that doubling the resolution or number of frames makes the process roughly four times more expensive. Meanwhile, the number of denoising steps scales linearly, so halving the number of steps cuts the energy and time required nearly in half.

Energy consumption and processing time grow quadratically with resolution (📷: J. Delavande et al.)

The team extended their analysis beyond just WAN2.1-T2V, benchmarking six leading open-source T2V models, including AnimateDiff, CogVideoX, Mochi-1, and LTX-Video. Across the board, they found similar trends. Most systems are compute-bound, meaning that performance is limited not by memory or bandwidth, but by the raw arithmetic horsepower of the GPU.

The researchers used NVIDIA’s powerful H100 GPU for testing, but found that they only achieved about 45% of the theoretical maximum performance. Due to factors like tile misalignment, kernel overheads, and memory-bound operations, peak performance is never achieved in practice. These factors only serves to make the problem of compute-bound algorithms worse.

Video diffusion models are already hundreds or thousands of times more computationally demanding than text or image generation, and their appetite for power will only grow as users demand longer and higher-resolution clips. That means future work in this area must focus not just on visual fidelity, but on sustainability. The team suggests researchers turn to techniques like quantization, diffusion caching, and attention optimization, which could reduce costs by 20-60% without hurting quality.

machine learning

artificial intelligence

video

energy efficiency

Nick Bild

R&D, creativity, and building the next big thing you never knew you wanted are my specialties.

Taking On the Text-To-Video Efficiency Problem

Text-to-video AI is extremely computationally-intensive, but Hugging Face researchers think that future models can be made more efficient.

Latest articles

Sponsored articles

Related articles

Latest articles

Related articles