Researchers at Meta have developed new photorealistic synthetic datasets using Unreal Engine that enable more controlled and robust evaluation and training of AI vision systems.
Meta researchers have introduced a family of synthetic image datasets called PUG (Photorealistic Unreal Graphics) that aim to provide new capabilities for the evaluation and training of AI vision systems. They use the Unreal Engine, a state-of-the-art real-time 3D graphics engine, to render photorealistic image data.
While synthetic datasets have been created before, the researchers say they often lacked realism, limiting their usefulness. By leveraging the photorealism of Unreal Engine, the PUG datasets aim to bridge the gap between synthetic and real-world data.
The researchers present four PUG datasets:
- PUG: Animals contains over 200,000 images of animals in various poses, sizes, and environments. It can be used to study out-of-distribution robustness and model representations.
- PUG: ImageNet with over 90,000 images as an additional robustness test set to ImageNet, containing a rich set of factor changes such as pose, background, size, texture, and lighting.
- PUG: SPAR with over 40,000 images is used to evaluate vision language models on scene, position, attribute, and relation understanding.
- PUG: AR4T provides approximately 250,000 images for fine-tuning vision-language models for spatial relations and attributes.
PUG already showed weak robustness in leading ImageNet models
In addition to the datasets, researchers can use the PUG environment to create their own data, precisely specifying factors such as lighting and viewpoint that are difficult to control with real-world datasets. The ability to generate data that covers a range of domains enables more reliable evaluation and training of vision language models compared to existing benchmarks, the team writes.
In experiments, the researchers demonstrate PUG’s ability to benchmark model robustness and representation quality: PUG showed that the top models in ImageNet were not necessarily the most robust to factors such as pose and lighting. It also allowed the study of how different vision-language models capture relationships between images and text.
More information and data are available on the PUG project website.