Alibaba Cloud Releases the "Open" Qwen3-Omni, Its First "Natively End-to-End Omni-Modal AI"
The single model, released in three tailored versions at 30 billion parameters each, can handle text, audio, video, and imagery.
Chinese technology giant Alibaba Cloud has announced a new entry in its Qwen, or Tongyu Qianwen, family of large language models: Qwen3-Omni, which it claims is the first "natively end-to-end omni-modal AI" capable of working with text, video, audio, and image content in a single model — and it has released three variants, each with 30 billion parameters, for others to try.
"We release Qwen3-Omni, the natively end-to-end multilingual omni-modal foundation models," Alibaba Cloud's Xiong Wang says of the company's latest launch in the large language model field. "It is designed to process diverse inputs including text, images, audio, and video, while delivering real-time streaming responses in both text and natural speech."
Large language models are the technology underpinning the current "artificial intelligence" boom: statistical models that ingest vast quantities of, typically copyright, data and distill them into "tokens" — then turn input prompts into more tokens before responding with the most statistically-likely continuation tokens, which are then decoded into something in the shape of an answer. When all has gone well, the answer-shape matches reality; otherwise the response is an answer in appearance only, divorced from reality.
"Qwen3-Omni adopts the Thinker-Talker architecture," the company's LLM development team says of the new model. "Thinker is tasked with text generation while Talker focuses on generating streaming speech tokens by receives high-level representations directly from Thinker. To achieve ultra–low-latency streaming, Talker autoregressively predicts a multi-codebook sequence. At each decoding step, an MTP module outputs the residual codebooks for the current frame, after which the Code2Wav renderer incrementally synthesizes the corresponding waveform, enabling frame-by-frame streaming generation."
The ability to "think" and "talk" at the same time, though, isn't what makes Qwen3-Omni interesting; rather, it's the promise of a true multi-modal model — a single model that can handle text, video, audio, and image inputs and outputs in one. "Mixing unimodal and cross-modal data during the early stage of text pretraining can achieve parity across all modalities, i.e., no modality-specific performance degradation," the company claims, "while markedly enhancing cross-modal capabilities."
The company's own technical report, however, pours a little cold water on this latter claim: while Qwen3-Omni shows strong performance across all media types, by the standards of modern LLMs, its performance while handling text is noticeably weaker than the earlier Qwen3-Instruct model — suggesting there is, indeed, a trade-off when moving from a modality-specific model to a Jack-of-all-trades.
Other features of the model include support for 119 languages in text mode, 19 for speech recognition, and 10 for speech generation, audio-only latency as low as 211ms and audio-video latency as low as 507ms, support for audio inputs of up to 30 minutes in length, and support for "tool-calling" — the ability to execute external programs in order to create an "agentic" AI assistant able to take action to complete tasks rather than merely respond with instruction-shaped details on how to do it yourself.
Alibaba Cloud has released three tailored variants of the model — Qwen3-Omni-30B-A3B-Instruct, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner — on GitHub, Hugging Face, and ModelScope under permissive Apache 2.0 license; as is usual in the field of LLMs, though, these are not truly "open source" models as not everything required to build the models from scratch is provided. A demo is also available on Hugging Face.
Freelance journalist, technical author, hacker, tinkerer, erstwhile sysadmin. For hire: freelance@halfacree.co.uk.