Qwen Promises AI Agent Capabilities From It Free-to-Use Qwen2.5-VL Vision Language Models
Alibaba-owned Chinese AI startup offers 72-billion, seven-billion, and three-billion parameter models — but beware license restrictions.
Chinese AI startup Qwen has announced the launch, under a free-to-use license, of its latest vision language model (VLM), promising competitive performance and deeper image analysis abilities than ever before: Qwen2.5-VL.
"We release Qwen2.5-VL, the new flagship vision-language model of Qwen and also a significant leap from the previous Qwen2-VL," Qwen, an Alibaba subsidiary, says of its latest model family. "In terms of the flagship model Qwen2.5-VL-72B-Instruct, it achieves competitive performance in a series of benchmarks covering domains and tasks, including college-level problems, math, document understanding, general question answering, video understanding, and visual agent [tasks]. Notably, Qwen2.5-VL achieves significant advantages in understanding documents and diagrams, and it is capable of playing as a visual agent without task-specific fine-tuning."
A multi-modal model, Qwen2.5-VL is designed to convert a textual input prompt and supporting image or video data into tokens then predict the most statistically likely output tokens — forming a response that, as with all large language models (LLMs) and related systems, will sometimes but not always chain into the form of a correct "answer" to the query. In the case of Qwen2.5-VL, its creators claim it delivers the ability to "understand things visually" — glossing over the fact that no understanding is actually taking place — and deliver responses based on images containing text, charts, and other graphics, as well as objects and scenes.
A major upgrade over previous models, Qwen says, is the model's ability to use video content over one hour in length and to pinpoint particular events in the video with timestamps. Images can be localized with bounding boxes, complete with accompanying JSON, and output can be structured rather than plain text. Perhaps the biggest change, though, is the claim that Qwen2.5-VL is "agentic" — or, in other words, capable of taking actions on behalf of its user rather than simply providing a response, which includes steps to be taken to achieve a particular task.
"Qwen2.5-VL directly plays as a visual agent that can reason and dynamically direct tools, which is capable of computer use and phone use," its creators claim, using examples including the booking of a flight in a separate airline app, using a browser to find a particular weather forecast, using an image editor to increase the color vibrancy in a photo, and even installing a Microsoft Visual Studio Code (VS Code) extension.
Qwen claims the largest version of its new model, Qwen2.5-VL-72B-Instruct with 72 billion parameters, performs competitively against Google's Gemini-2 Flash, OpenAI's GPT-4o, and Anthropic's Claude3.5 Sonnet models in a range of tasks, outperforming them by a small margin on some including document analysis. Its smaller Qwen2.5-VL-7B model, meanwhile, is competitive against GPT4o-Mini, while the smallest Qwen2.5-VL-3B model with three billion parameters can match or exceed the company's own last-generation Qwen2-VL-7B model with more than twice the number of parameters.
Qwen has released the new models, in all three sizes, on HuggingFace under a trio of different licenses; the large 72-billion-parameter model uses the Qwen License, which allows for free use and modification but restricts commercial use to services with fewer than 100 million monthly active users (MAUs), the small three-billion-parameter model uses the Qwen Research License that blocks all commercial use, and the middle seven-billion-parameter model uses the permissive Apache License 2.0.