Salesforce has released a new suite of open-source large multimodal AI models. These models are positioned to expedite the research and development of more capable AI systems. The AI models, dubbed xGen-MM (also known as BLIP-3), showcase significant advancements in AI’s ability to understand and generate content combining text, images, and other data types.
In a paper published on arXiv, Salesforce AI researchers have detailed the xGen-MM framework, which includes pre-trained models, datasets, and code for fine-tuning. As stated, the largest model, with 4 billion parameters, achieves competitive performance on various benchmarks compared to similar-sized open-source models.
According to researchers, xGen-MM has an innovative ability to manage “interleaved data” by combining multiple images and text, which is described as “the most natural form of multimodal data.”
With this capability, the models can answer questions about multiple images simultaneously, a skill that would be invaluable in real-world applications ranging from medical diagnosis to autonomous vehicles.
The release includes variants of the model optimized for different purposes, including a base pretrained model, an “instruction-tuned” model for following directions, and a “safety-tuned” model designed to reduce harmful outputs. This optimization of models highlights an increased awareness in the AI community of the need to balance capability with safety and ethical considerations.
The xGen-MM models were trained on massive datasets curated by the Salesforce team, including a trillion-token-scale dataset of interleaved image and text data called “MINT-1T.” The researchers also created new datasets focused on optical character recognition and visual grounding.