STEVE-1 is a generative AI model that can perform tasks in Minecraft using text instructions.
AI models that can respond to natural language instructions have become incredibly popular, but creating models that can follow instructions for complex sequential tasks remains a challenge. Researchers have now introduced STEVE-1, an AI assistant that can follow a wide range of short-horizon text and visual instructions in Minecraft.
STEVE-1 builds on two existing AI models – VPT, a foundation model pre-trained on 70,000 hours of Minecraft gameplay, and MineCLIP, which aligns text captions with Minecraft videos. Using an approach inspired by DALL-E 2’s unCLIP method, the researchers fine-tuned VPT to follow visual goals encoded by MineCLIP, and then trained a module to translate text prompts into MineCLIP visual embeddings.
This two-step model allows STEVE-1 to follow both text and visual instructions in Minecraft with only $60 of computation and 2,000 labeled examples.
STEVE-1 outclasses previous AI agents in Minecraft
In their tests, STEVE-1 significantly outperformed previous AI agents in Minecraft when given relevant instructions, gathering far more resources and exploring farther, and can perform a variety of short-term tasks such as chopping trees, gathering resources, and exploring when prompted with text or images.
The researchers found that chaining prompts improved performance on longer-term tasks, such as crafting items or building structures, from near zero to a success rate of 50 to 70 percent. The team also shows STEVE-1 responding to human instructions in real time, demonstrating its potential as an interactive assistant.
STEVE-1 is a blueprint for “instructable agents in domains beyond Minecraft”
Although, similar to image generation, switching to a longer, more specific prompt dramatically improves STEVE-1’s performance on long-horizon tasks, it is similarly unintuitive and time-consuming, and more work needs to be done, the paper states.
Because STEVE-1 works directly from raw pixel input and low-level mouse and keyboard actions, the approach could be applied more broadly to create instructable agents in domains beyond Minecraft, the team said. Future work will focus on improving STEVE-1’s ability to handle longer, more complex instructions by incorporating large language models to help the agent plan and execute multistep tasks.
More information and the code is available on the STEVE-1 project page.