Table of contents
Table of contents
[mathjax]
In 2013 the Deepmind team invented an algorithm called deep Q-learning. It learns to play Atari 2600 games using only the input from the screen. Following a call by OpenAI, we adapted this method to deal with a situation where the playing agent is given not the screen, but rather the RAM state of the Atari machine. Our work was accepted to the Computer Games Workshop accompanying the IJCAI 2016 conference. This post describes the original DQN method and the changes we made to it. You can re-create our experiments using a publicly available code.
Atari games
Atari 2600 is a game console released in the late 1970s. If you were a lucky teenager at that time, you would connect the console to the TV-set, insert a cartridge containing a ROM with a game and play using the joystick. Even though the graphics were not particularly magnificent, the Atari platform was popular and there are currently around games available for it. This collection includes immortal hits such as Boxing, Breakout, Seaquest and Space Invaders.Reinforcement Learning
We will approach the Atari games through a general framework called reinforcement learning. It differs from supervised learning (e.g. prediction what is represented in an image using Alexnet) and unsupervised learning (e.g. clustering, like in the nearest neighbours algorithm) because it utilizes two separate entities to drive the learning:- agent (which sees states and rewards and decides on actions) and
- environment (which sees actions, changes states and gives rewards).
Q-values
Q-value (also called action-value) measures how attractive a given action is in a given state. Let’s assume that the agent’s strategy (the choice of the action in a given state) is fixed. Then a Q-value of a state-action pair is the cumulative discounted reward the agent will get if it is in a state , executes the action and follows his strategy from there on. The Q-value function has an interesting property – if a strategy is optimal, the following holds: $$ Q(s_t, a_t) = r_t + gamma max_a Q(s_{t+1}, a) $$ One can mathematically prove that the reverse is also true. Namely, any strategy which satisfies the property for all state-action pairs is optimal. This fact is not restricted to deterministic strategies. For stochastic strategies, you have to add some expectation value signs and all the results still hold.Q-learning
The above concept of Q-value leads to an algorithm learning a strategy in a game. Let’s slowly update the estimates of the state-action pairs to the ones that locally satisfy the property and change the strategy, so that in each state it will choose an action with the highest sum of expected reward (estimated as an average reward received in a given state after following given action) and biggest Q-value of the subsequent state.for all (s,a): Q[s, a] = 0 #initialize q-values of all state-action pairs for all s: P[s] = random_action() #initialize strategy # assume we know expected rewards for state-action pairs R[s, a] and # after making action a in state s the environment moves to the state next_state(s, a) # alpha : the learning rate - determines how quickly the algorithm learns; # low values mean more conservative learning behaviors, # typically close to 0, in our experiments 0.0002 # gamma : the discount factor - determines how we value immediate reward; # higher gamma means more attention given to far-away goals # between 0 and 1, typically close to 1, in our experiments 0.95 repeat until convergence: 1. for all s: P[s] = argmax_a (R[s, a] + gamma * max_b(Q[next_state(s, a), b])) 2. for all (s, a): Q[s, a] = alpha*(R[s, a] + gamma * max_b Q[next_state(s, a), b]) + (1 - alpha)Q[s, a]This algorithm is called Q-learning. It converges in the limit to an optimal strategy. For simple games like Tic-Tac-Toe, this algorithm, without any further modifications, solves them completely not only in theory but also in practice.