Researchers at Google Research and Google DeepMind introduce PaLI-3, a vision language model (VLM) that is smaller, faster, and more powerful than comparable models that are ten times larger.
PaLI-3, a 5-billion-parameter vision language model (VLM) that can process images and language, has outperformed models ten times larger in several multimodal benchmarks, according to the research team.
VLMs can answer questions about images, describe videos, recognize objects, or read text on images. OpenAI offers such a VLM with GPT-4-Vision, and companies like Nvidia also see VLMs as an important building block for future industrial AI applications.
Scaling improves VLM performance
VLMs typically consist of a pre-trained image model that has learned to associate text with images, and a language model. PaLI-3’s architecture follows the lead of its predecessors and includes a vision transformer that encodes the image into tokens. These tokens, along with text input, are passed to an encoder-decoder transformer that produces text output.
Google has shown with its predecessors PaLI and PaLI-X that while a highly scaled vision transformer does not necessarily produce better results for image-only tasks such as ImageNet, it can achieve significant performance leaps for multimodal tasks such as answering questions about images. With PaLI-X, Google has scaled up to 55 billion parameters.
Google’s PaLI-3 relies on familiar architecture with new training method
While Google uses a JFT encoder specialized for image classification for the vision transformer in PaLI-X, PaLI-3 uses a contrastively pretrained vision transformer (SigLIP) similar to CLIP. The ViT has only 2 billion parameters, and together with the language model, PaLI-3 has only 5 billion parameters.
Such smaller models are more practical for training and deployment, more environmentally friendly, and allow for faster research cycles for model design, the researchers said. Also convenient is that, despite its small size, PaLI-3 performs on par with today’s best VLMs in more than 10 image-to-speech benchmarks, and – despite not being trained on video data – achieves new bests in benchmarks in which VLMs must answer questions about video.
PaLI-3 could enable a new generation of larger models
As is often the case, however, the trend will be toward larger models, because PaLI-3’s high performance despite its small size demonstrates the potential of the SigLIP method used to train the vision transformer on unstructured Web data. Given the availability of such unstructured multimodal data, it is likely that Google will soon train a larger version of PaLI-3.
“We consider that PaLI-3, at only 5B parameters, rekindles research on fundamental pieces of complex VLMs, and could fuel a new generation of scaled-up models” the team writes.