Meta's research team has recently introduced a multi-modal family of large language models named Chameleon. According to the company’s research paper, Chameleon utilizes an 'early-fusion token-based mixed-modal' architecture. Under this, the model learns to process a variety of inputs, such as images, code, text, and a combination of those, to create sequences.
The architecture enables the model to handle multiple inputs in ways not possible with most other systems.
The researchers developed and used new training techniques to allow the model to work with multiple types of tokens. This involved two-stage learning and a massive dataset of approximately 4.4 trillion text tokens, image-text combinations, and sequences of interwoven texts and images.
Further, the system was trained using 7 billion parameters and one with 34 billion parameters for more than 5 million hours on a high-speed GPU.
"Chameleon's unified token space allows it to seamlessly reason over and generate interleaved image and text sequences without the need for modality-specific components," the research paper stated.
The Chameleon team also demonstrates that the model outperforms Llama -2 in text-only tasks and can compete with models such as Mixtral 8x7B and Gemini-Pro.