Meta’s new state-of-the-art, versatile image model is trained solely on licensed data


Meta’s latest image model CM3leon can understand and generate both text and images. It can create images from text descriptions and compose text based on images, making it useful for many tasks.

CM3leon (pronounced “chameleon”) is a single foundation model capable of both text-to-image and image-to-text generation. It is the first multimodal model trained with a recipe adapted from text-only language models that can input and generate both text and images.

CM3Leon’s architecture uses a decoder-only tokenizer-based transformer network, similar to text-based models. It builds on previous work (RA-CM3), utilizing an external database during training with something called “retrieval augmentation”. While other models might only learn from the raw data fed to them, models with retrieval augmentation actively seek out the most relevant and diverse data for their learning process during training, making the training phase more robust and efficient.

Meta claims it requires five times less computation than previous transformer-based methods and less training data, making it as efficient to train as existing diffusion-based models.


A multitasking chameleon

Thanks to large-scale multitask instruction tuning, CM3leon can perform a variety of tasks, including text-guided image generation and editing, text-to-image generation, text-guided image editing, caption generation, visual question answering, and structure-guided image editing.

“Instruction tuning” means that the model is trained to follow instructions given in text format. For example, you could provide an instruction such as “describe an image of a sunset over the ocean,” and the AI ​​model will generate a description based on that instruction. The model has been trained on such examples in the wide variety of tasks mentioned above.

(1) A small cactus wearing a straw hat and neon sunglasses in the Sahara desert. (2) A close-up photo of a human hand, hand model. High quality. (3) A raccoon main character in an Anime preparing for an epic battle with a samurai sword. Battle stance. Fantasy Illustration. (4) A stop sign in a Fantasy style with the text “1991.”

Meta also says that scaling recipes developed for text-only models generalize directly to tokenization-based image generation models, which implies even better results with bigger models, trained longer on more data. CM3leon’s training included a large-scale retrieval-augmented pre-training phase on huge amounts of data, and then it undergoes a supervised fine-tuning (SFT) phase with instructions to get its multitasking capabilities.

On the image generation benchmark (zero-shot MS-COCO), CM3leon achieves a Fréchet Inception Distance (FID) score of 4.88, which is a new state-of-the-art result and beats Google’s Parti image model.

More consistency, more licensing, more metaverse

According to Meta, CM3leon excels at producing coherent images that better follow even complex input instructions. It can better recover global shapes and local details, generate text or numbers as they appear in the prompt, and solve tasks like text-guided image editing that previously required specialized models like Instruct Pix2Pix.


Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top