AI News Bureau
Written by: CDO Magazine Bureau
Updated 7:41 AM UTC, Mon November 25, 2024
Representative image. Source: Microsoft
Tongji University and Microsoft Corporation researchers have proposed a novel approach called LLM2CLIP that embraces the power of Large Language Models (LLMs) to unlock Contrastive Language–Image Pre-training’s (CLIP’s) potential.
CLIP is one of the most important multimodal foundational models. It combines visual and textual signals into a shared feature space using a simple contrastive learning loss on large-scale image-text pairs
This method takes a straightforward step by replacing the original CLIP text encoder and enhances the CLIP visual encoder with extensive knowledge of LLMs.
Notably, the LLM2CLIP method effectively improved the CLIP model by integrating large language models (LLMs) like Llama. Initially, LLMs struggled as text encoders for CLIP due to their inability to clearly distinguish image captions.
Now, the researchers have introduced the caption contrastive fine-tuning technique to address this, greatly improving the LLM’s ability to separate captions. This fine tuning led to a substantial performance boost, surpassing existing state-of-the-art models.
According to the researchers, the LLM’s presence has enabled incorporation of longer and more complex captions without being restricted by vanilla CLIP’s text encoder’s context window and ability limitations.