Alibaba Group Holding launched a group of math-specific large language models (LLMs) called Qwen2-Math, which, according to the e-commerce giant, can outperform the capabilities of OpenAI’s GPT-4o in that field.
Alibaba’s Qwen team published a post on GitHub that read, “Over the past year, we have dedicated significant efforts to researching and enhancing the reasoning capabilities of large language models, with a particular focus on their ability to solve arithmetic and mathematical problems.”
The latest LLMs were built on the Qwen2 LLMs released by Alibaba in June and cover three models based on their scale of parameters.
Notably, the model with the largest parameter count, Qwen2-Math-72B-Instruct, outperformed proprietary US-developed LLMs in math benchmarks, including GPT-4o, Anthropic’s Claude 3.5 Sonnet, Google’s Gemini 1.5 Pro, and Meta Platforms’ Llama-3.1-405B, as per the Qwen team’s post.
Further, the Qwen2-Math AI models were tested on both English and Chinese math benchmarks. These included GSM8K, a data set of 8,500 high-quality, linguistically diverse grade school math problems; OlympiadBench, a high-level bilingual multimodal scientific benchmark; and the gaokao, the mainland’s daunting university entrance examination.
While the new models still have certain limitations because of the “English-only support,” the team plans to shortly release bilingual models, with multilingual LLMs also in the development pipeline.