How DeepMind’s Genie AI could reshape robotics by generating interactive worlds from images


Researchers at DeepMind have developed Genie, a model that creates worlds from images and moves video game characters around in them on their own. It sounds like a gimmick, but it could be the basis for something much bigger.

“What if, given a large corpus of videos from the Internet, we could not only train models capable of generating novel images or videos, but entire interactive experiences?”  That’s the question researchers at Google DeepMind asked themselves as they developed their new AI model, Genie (Generative Interactive Environments).

Video: Google DeepMind

Genie can transform various types of image prompts into virtual worlds and logically move game characters within them. At first glance, this is particularly interesting for video games. However, the researchers believe that their approach with Genie could be an important step toward world models for robotics applications.



Foundation model for 2D platformers

In its largest form, Genie is an 11-billion-parameter AI model that has the properties of a foundation model for 2D platformers: Given a visual input that is completely unknown to the model and a human-specified action that is roughly equivalent to pressing a gamepad button, Genie generates a virtual world in which the action is performed.

Image: Google DeepMind

The actual actor, i.e. the sword-wielding hero or a ball in a hand-drawn sketch, is not fixed – the model has learned through training which elements in an image usually perform actions and then move them independently. Another interesting observation: Genie even takes into account the parallax effect that occurs when the foreground and background move at different speeds in a game.

Image: Google DeepMind

Unlabeled gaming videos from the internet as training material

The special feature of the model is that it learns exclusively from videos – i.e. it does not receive any other information such as gamepad inputs during training. A collection of originally 200,000 hours of freely available game videos from the Internet served as the basis, which the researchers filtered down to 30,000 hours of material specifically for 2D platforms.

Genie consists of three components: a video tokenizer that generates tokens from frames, a latent action model that predicts actions between frames, and a dynamics model that predicts the next frame of the video. For the latent action model, the team limits the number of predicted actions to a small, discrete set of codes to enable human playability and further improve controllability.

Image: Google DeepMind

Genie uses “spatiotemporal (ST) transformers” for its components. As is often the case with transformers, the team found that Genie’s performance improved as the number of parameters increased.


Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top