AI News Bureau

Alibaba Cloud Announces New AI Model Qwen2-VL that Supports Function Calling

The model can be integrated into devices such as mobile phones and robots, allowing for automated operations based on visual environments and text instructions. This feature highlights Qwen2-VL’s potential as a powerful tool for tasks that require complex reasoning and decision-making.

Written by: CDO Magazine Bureau

Updated 11:10 AM UTC, Wed September 4, 2024

Qwen2-VL

Alibaba Cloud announces the release of Qwen2-VL, its newest advanced vision-language model designed to enhance visual understanding, video comprehension, and multilingual text-image processing.

Alibaba’s new model comes in three variants of different parameter sizes: Qwen2-VL-72B (72-billion parameters), Qwen2-VL-7B, and Qwen2-VL-2B.

The 7B and 2B variants are available under open-source permissive Apache 2.0 licenses, enabling enterprises to leverage them at will for commercial purposes, positioning them as options for potential decision-makers. They are designed to deliver competitive performance at a more accessible scale and can be availed on platforms like Hugging Face and ModelScope.

However, the largest 72B model still awaits public release and will only be made available later through a separate license and application programming interface (API) from Alibaba.

Built on the foundation of the Qwen model family, this series brings significant advancements in several key areas:

The model can be integrated into devices such as mobile phones and robots, allowing for automated operations based on visual environments and text instructions. This feature highlights Qwen2-VL’s potential as a powerful tool for tasks that require complex reasoning and decision-making.

Additionally, Qwen2-VL supports function calling. In other words, the model can look at and understand “flight statuses, weather forecasts, or package tracking,” which Alibaba says makes it capable of “facilitating interactions similar to human perceptions of the world.”

Qwen2-VL introduces several architectural improvements aimed at enhancing the model’s ability to process and comprehend visual data.

The Naive Dynamic Resolution support allows the models to handle images of varying resolutions, ensuring consistency and accuracy in visual interpretation. Also, the Multimodal Rotary Position Embedding (M-ROPE) system enables the models to simultaneously capture and integrate positional information across text, images, and videos.

Qwen2-VL supports languages including English, Chinese, most European languages, Japanese, Korean, Arabic, and Vietnamese. Alibaba boasts it can analyze videos longer than 20 minutes and answer questions about the contents.