Expert augmented reinforcement learning – agents of Montezuma’s Revenge

Reinforcement learning is gaining notice as a way to train neural networks to solve open problems that require a flexible, creative approach. As a huge amount of computing power and time are required to train reinforcement learning agent, it is no surprise that researchers are looking for ways to shorten the process. Expert augmented learning appears to be an interesting way to do that.

This article looks at:

  • Why the learning process in reinforcement learning is long and complex
  • The transfer of expert knowledge into neural networks to solve this challenge
  • Applying expert augmented reinforcement learning in practice
  • Possible use cases for the technique

Designing a system of rewards that motivates an RL agent to behave in the way that is desired is fundamental to the technique. While this is indeed effective, there are still a number of drawbacks that limit its usefulness. One is the complexity of the training process, which grows rapidly with the complexity of the problems to be solved. What’s more, the agent’s first attempts to solve problems are usually entirely random. In learning to run, a project in which an agent was trained to move like a human, the agent would fall forward or backward during its few million initial runs.

When both the environment and the task are complex, the possibilities for “doing it wrong” grows and the data scientist may be unable to spot the hidden drawback within the model.

Of course, the agent looks for ways to maximize the reward and reduce the penalties usually without seeing the larger picture. That’s why any glitch in the environment will be maximally exploited when discovered. Here’s a good example from the game Qbert:

Details about both the agent and the bug found are covered in this paper: Arxiv.

The challenge in teaching neural networks to perform tasks humans do so effortlessly, like grabbing a can of coke or driving a car, is transferring the knowledge required to perform the task. It would be awesome just to put the neural network in the seat next to Kimi Raikkonen and let it learn how to handle the car like a professional driver. Unfortunately, that isn’t possible.

Or is it?

Montezuma’s revenge on AI

The most common way to validate reinforcement learning algorithms is to let them play Atari’s all-time classics like Space Invaders or Breakout. These games provide an environment that is complex enough to test if the model can deal with numerous variables, yet simple enough not to burn up the servers providing the computing power.

Although the agents tend to crack those games relatively easily, games like classic Montezuma’s Revenge pose a considerable challenge.

Related:  Building a Matrix with reinforcement learning and artificial imagination

For those who missed this classic, Montezuma’s Revenge is a platform game where an Indiana Jones-like character (nicknamed Panama Joe) explores the ancient Aztec pyramids, which are riddled with traps, snakes, scorpions and sealed doors, the keys to which, of course, are hidden in other rooms. While similar to Mario Bros games, it was one of the first examples of the “Metroidvania” subgenre, with the Metroid and Castlevania series being the most well-known games.

Montezuma’s Revenge provides a different gaming experience than Space Invaders: the world it presents is more open, and not all objects on the map are hostile. The agent needs to figure out that a snake is deadly, while the key is required to open the door and stepping on it is not only harmless but crucial to finishing the level.

Currently, reinforcement learning alone struggles to solve Montezuma’s Revenge. Having a more experienced player providing a guidance could be a huge time-saver.

The will chained, the mind unchained

To share human knowledge with a neural network, information must be provided about what experts do and how they behave in a given environment. In the case of Montezuma’s Revenge, this means providing a snapshot of the screen and the player’s reaction. If he or she is driving a car, any number of additional steps would have to be taken: the track would have to be recorded and information about the car and position of the steering wheel would also need to be provided.

At every stage of training, the agent is not only motivated to maximize rewards, but also to mimic the human. This is particularly helpful when there is no immediate reward coming from the game environment.

However, the drawback of following the expert is that the network doesn’t develop an ability to react to unexpected situations. Following the example of Raikonnen’s driving, the network would be able to perform well on a track that was recorded, but racing in other weather conditions or against new opponents would render the network helpless. This is precisely where reinforcement learning shines.

Related:  Learning to run - an example of reinforcement learning

In the case of Montezuma’s Revenge, our algorithm was trained to strike a balance between following the expert and maximizing the reward. Thus if the expert never stepped on the snake, the agent wouldn’t, either. If the expert had done something, it likely did the same. If the agent found itself in a new situation, it would try to follow the behavior of the expert. If the reward for ignoring suggestions was to high, it opted for the larger payload.

If you get lost, get to the road and stick to it until you get into a familiar neighborhood, right? The agent is always motivated to mimic the expert’s actions. Methods which just initially copy human behavior and then let the agent explore randomly are too weak to deliver noteworthy results.

The idea of augmenting reinforcement learning with expert knowledge proved to be surprisingly effective. Our model performed well in Montezuma’s Revenge, beating level after level. Moreover, it didn’t stop exploiting the reward policy to maximize its rewards. The Agent spotted an unpublished bug in the game. This discovery led to the score of the 804 900 points – a world record. Our agent was pushed on by the endless reward maximization loop depicted here:

Although annoying, the loop itself is proof that the agent is not mindlessly following the expert. With enough motivation it is able to develop its own strategy to maximize its rewards, thus using the expert knowledge creatively.

Cloning and enhancing human behavior are among the ultimate goals of machine learning. Nevertheless, the expert doesn’t actually need to be a human. This leads to interesting possibilities. A machine can be used to mimic other machines programmed with methods that don’t employ artificial intelligence and then build on top of it.

Summary – reducing costs

Empowering reinforcement learning with expert knowledge opens new avenues of development for AI-powered devices.

  • It uses the best from two worlds by following human behavior and a superhuman talent characteristic for reinforcement learning agents manifesting in exploiting convenient opportunities and loopholes present in the environment.
  • It increases safety by reducing randomness, especially in the early stage of learning.
  • It significantly reduces the time required for learning, as the agent gets hints from a human expert, thus reducing the need for completely random exploration.

As the cost of designing a reinforcement learning agent grows exponentially alongside the task’s level of complexity and the number of variables involved, using expert knowledge to train the agent is very cost-effective: it reduces not only the cost of data and computing power, but also the time required to gain results. The technical details of our solution can be found here Arxiv.org and here GitHub repository.

Special cooperation

In this project we cooperated with independent researcher Michał Garmulewicz (blog, github), who provided fundamental technical and conceptual input. We hope to continue such cooperation with Michał and other researchers.

Related Posts

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *