A bot that watched 70,000 hours of Minecraft could unlock AI’s next big thing

OpenAI has built the best Minecraft-playing bot yet by making it watch 70,000 hours of video of people playing the popular computer game. It showcases a powerful new technique that could be used to train machines to carry out a wide range of tasks by binging on sites like YouTube, a vast and untapped source of training data.

The Minecraft AI learned to perform complicated sequences of keyboard and mouse clicks to complete tasks in the game, such as chopping down trees and crafting tools. It’s the first bot that can craft so-called diamond tools, a task that typically takes good human players 20 minutes of high-speed clicking—or around 24,000 actions.

The result is a breakthrough for a technique known as imitation learning, in which neural networks are trained how to perform tasks by watching humans do them. Imitation learning can be used to train AI to control robot arms, drive cars or navigate webpages.  

There is a vast amount of video online showing people doing different tasks. By tapping into this resource, the researchers hope to do for imitation learning what GPT-3 did for large language models. “In the last few years we’ve seen the rise of this GPT-3 paradigm where we see amazing capabilities come from big models trained on enormous swathes of the internet,” says Bowen Baker at OpenAI, one of the team behind the new Minecraft bot. “A large part of that is because we’re modeling what humans do when they go online.”

The problem with existing approaches to imitation learning is that video demonstrations need to be labeled at each step: doing this action makes this happen, doing that action makes that happen, and so on. Annotating by hand in this way is a lot of work, and so such datasets tend to be small. Baker and his colleagues wanted to find a way to turn the millions of videos that are available online into a new dataset.

The team’s approach, called Video Pre-Training (VPT), gets around the bottleneck in imitation learning by training another neural network to label videos automatically. They first hired crowdworkers to play Minecraft, and recorded their keyboard and mouse clicks alongside the video from their screens. This gave the researchers 2000 hours of annotated Minecraft play, which they used to train a model to match actions to onscreen outcome. Clicking a mouse button in a certain situation makes the character swing its axe, for example.  

The next step was to use this model to generate action labels for 70,000 hours of unlabelled video taken from the internet and then train the Minecraft bot on this larger dataset.

“Video is a training resource with a lot of potential,” says Peter Stone, executive director of Sony AI America, who has previously worked on imitation learning. 

Imitation learning is an alternative to reinforcement learning, in which a neural network learns to perform a task from scratch via trial and error. This is the technique behind many of the biggest AI breakthroughs in the last few years. It has been used to train models that can beat humans at games, control a fusion reactor, and discover a faster way to do fundamental math.

The problem is that reinforcement learning works best for tasks that have a clear goal, where random actions can lead to accidental success. Reinforcement learning algorithms reward those accidental successes to make them more likely to happen again.

But Minecraft is a game with no clear goal. Players are free to do what they like, wandering a computer-generated world, mining different materials and combining them to make different objects. 

Minecraft’s open-endedness makes it a good environment for training AI. Baker was one of the researchers behind Hide & Seek, a project in which bots were let loose in a virtual playground where they used reinforcement learning to figure out how to cooperate and use tools to win simple games. But the bots soon outgrew their surroundings. “The agents kind of took over the universe, there was nothing else for them to do” says Baker. “We wanted to expand it and we thought Minecraft was a great domain to work in.”

They’re not alone. Minecraft is becoming an important testbed for new AI techniques. MineDojo, a Minecraft environment with dozens of predesigned challenges, won an award at this year’s NeurIPS, one of the biggest AI conferences. 

Using VPT, OpenAI’s bot was able to carry out tasks that would have been impossible using reinforcement learning alone, such as crafting planks and turning them into a table, which involves around 970 consecutive actions. Even so, they found that the best results came from using imitation learning and reinforcement learning together. Taking a bot trained with VPT and fine-tuning it with reinforcement learning allowed it to carry out tasks involving more than 20,000 consecutive actions.  

The researchers claim that their approach could be used to train AI to carry out other tasks. To begin with, it could be used to for bots that use a keyboard and mouse to navigate websites, book flights or buy groceries online. But in theory it could be used to train robots to carry out physical, real-world tasks by copying first-person video of people doing those things. “It’s plausible,” says Stone.

“This work is another testament to the power of scaling up models and training on massive datasets to get good performance,” says Natasha Jaques, who works on multi-agent reinforcement learning at Google and the University of California, Berkeley. 

Large internet-sized data sets will certainly unlock new capabilities for AI, says Jaques. “We’ve seen that over and over again, and it’s a great approach.” But OpenAI places a lot of faith in the power of large data sets alone, she says: “Personally, I’m a little more skeptical that data can solve any problem.”

Still, Baker and his colleagues think that collecting more than a million hours of Minecraft videos will make their AI even better. It’s probably the best Minecraft-playing bot yet, says Baker: “But with more data and bigger models I would expect it to feel like you’re watching a human playing the game, as opposed to a baby AI trying to mimic a human.”