Deep Q Learning

Today, I want to show you how you can use deep Q learning to let an agent learn how to play a game. Deep Q learning is a method which was introduced by DeepMind in their 2015 Nature paper (pdf) to play Atari video games by just observing the pixels of the game. To make it a bit simpler for the case of this blog post, we will use a slightly easier game which is to balance a polestick on a paddle. Basically, it is a little floating paddle which holds a stick and the stick is shaky and might fall over to one of the sides. As a player, at any point, you can do one of two actions - move the paddle a bit to the left or move the paddle a bit to the right. Of course, your goal is to not let the stick fall over and to balance it as long as possible. The name Q value comes from quality of an action, because given a state, we want to predict the quality of all the different actions to pick the best one to take.

Note that this is a more advanced post. For starters, I recommend training a simple neural network with plain numpy.

Open AI Gym

Luckily, we don’t have to program the whole game, but can instead use the implementation provided by open AI gym which provides several different games to train your agents (you can also play Pacman!).

So, let’s get started with the polestick and first to get a feeling, let’s just take random actions:

seed = 1234

import random
import numpy as np
np.random.seed(seed)
random.seed(seed)

import time
import gym

delay = True

env = gym.make('CartPole-v1')
env.seed(seed)
stateSize = env.observation_space.shape[0]
actionSize = env.action_space.n

state = env.reset()
rewards = 0
for _ in range(1000):
    env.render()
    if (delay):
        time.sleep(0.1)
    action = env.action_space.sample()
    state, reward, gameOver, _ = env.step(action)
    rewards += reward
    if gameOver:
        env.close()
        env.reset()
        print(rewards)
        break

16.0 (i.e. 16 actions taken before the agent died)

Explanation of the code

We are using gym to start our CartPole environment (env) and determine our stateSize and actionSize. Turns out that the stateSize is 4, so each state consists of 4 variables and our actionSize is 2 meaning we have two possible actions (0 for left and 1 for right).

We start a new game with env.reset() which returns the initial state of the game.

state

array([ 0.16803693, 0.7430099 , -0.2232899 , -1.26153707])

As you can see, the state is an array consisting of 4 variables which are continuous and can be both negative and positive numbers. They are related to the position of the paddle, the velocity and the angle of the stick. The details are not important for us here, so I won’t go into further detail.

In a for loop, I take a random action (sample() method) and call env.step(action). Taking a step, i.e. executing an action, returns us the next state of the game, a reward for arriving at that state and gameOver - a boolean variable which indicates if the game ends here.

I accumulate the rewards over time and print them once the game is over. In the meanwhile using env.render(), the game can be visualized which is nice to get an impression of what is happening.

It turns out, random actions don’t work very well. In this game, you get 1 reward for each action taken, so a total sum of 16.0 rewards means the agent only survived for 16 steps.

Let’s make that much better!

Q learning - theory

Ok, so we want to learn how to play that game, but how do we do that? The idea is as follows: our agent first has no idea of what to do and how everything works, so we will let it act randomly, but observe (and remember) what is happening. In particular, we record the tuple (state, action, reward, nextState, gameOver) for each step taken. This can be used to learn from our experience later on.

The idea of Q learning is to learn a policy, i.e. an indicator of what action to take given a state of the game. Thus, we want to learn to predict the value of taking a particular action $a$ when we are in state $s$. Let’s assume $Q(s, a_{left}) = 1.2$ and $Q(s, a_{right}) = 1.5$ That would mean given our state $s$, the action to go left ($a_{left}$) has a value of 1.2 while the action to go right has a higher value of 1.5. Therefore, we would go right as the value is higher.

How can we learn this Q function?

If you think about it, you can come to the following equation:

$$Q(s,a) = r + \gamma \max\limits_{a’} Q(s’,a’)$$

It basically says that the value of taking action $a$ in state $s$ ($Q(s,a)$) should be the same as the sum of the reward we receive when taking action $a$ added to the value we expect in the next state $s’$ when we take the best possible action $a’$. $\gamma$ is just a discount parameter to discount the future - this is similar to money and interest rates - it is always better to have 100 dollars now than 100 dollars in 2 years. Usually, $\gamma$ is set to something like 0.95 meaning the future is important, but not as important as the current reward. Thus, whenever we get a reward $r$ and a follow-up state $s’$ after taking an action $a$ in state $s$, we can have a look at what our current Q function predicts and then update it to better reflect the above equation.

Deep Q learning

Now what about deep learning and neural networks? To be honest, for our little cartpole here, we could also use a regular Q learning approach without neural networks as the idea of neural networks is to use an approximation when the state is so complex that we are not able to learn all the states.

Let’s dive in. We will construct a neural network which always picks the best action to play and update it according to our experiences.

We are using Keras to define our net and keep a pretty simple architecture. We have an 5-neuron input layer which expects 4 inputs (the state!), a 5-neuron hidden layer and an output layer which generates 2 values (one Q value for action $a_{left}$ and one for $a_{right}$):

from keras.models import Sequential
from keras.optimizers import Adam
from keras.layers import Dense, BatchNormalization, Activation

numNeurons = 5
learningRate = 0.001

player = Sequential()
player.add(Dense(numNeurons, input_dim=stateSize, kernel_initializer='glorot_uniform'))
player.add(Activation('relu'))
player.add(Dense(numNeurons, kernel_initializer='glorot_uniform'))
player.add(Activation('relu'))
player.add(Dense(actionSize, activation='linear', kernel_initializer='glorot_uniform'))
player.compile(optimizer=Adam(lr=learningRate), loss='mse')

Using TensorFlow backend.

Then we need some helper functions which we will be using later on. The first one is chooseAction(). We pass the qValues which we calculated to that function and an extra parameter $\epsilon$ which controls how likely we are to act randomly. This is useful, because our qValues are not very good at first, so it makes sense to act randomly to explore much, but later on, when we have better qValues, we reduce $\epsilon$ and then act more according to our Q policy. If we follow the policy, we look at the qValues and take the action which has the higher value:

def chooseAction(qValues, epsilon):
    numActions = len(qValues)
    if (np.random.random() <= epsilon):
        return np.random.choice(numActions, 1)[0]
    else:
        return np.argmax(qValues)

To get a better grasp, I want to illustrate the function. First, acting totally random ($\epsilon=1$):

chooseAction([0.2,1.2], 1)

0

You can see that it is random as it returned action 0, although the value for action 1 (1.2) is much higher than the value for action 0 (0.2). Next, we want to act according to the policy ($\epsilon=0$):

chooseAction([0.2,1.2], 0)

1

Sure enough, we get action 1, because 1.2 > 0.2. Our next helper function is to store the current experience. It should be very easy to understand. The only explainable part is the reward = -50 for gameOver states. I added this to learn quicker. As we want to learn to avoid finishing the game (the stick falling down), it helps to mark it with a negative reward. The agent would also learn if we leave this out, but takes a longer time to converge (try this out!).

from collections import deque

queueSize = 100000
storage = deque([], queueSize)

def storeObservation(state, action, reward, nextState, gameOver):
    if (gameOver):
        reward = -50 # Treat finish as bad event and store more often to rebalance
        storage.append([state.reshape(4), action, reward, nextState, gameOver])
    storage.append([state.reshape(4), action, reward, nextState, gameOver])

Our last helper function helps us to retrieve our stored observations. We can pass as argument how many games we want to get from the memory and we then get arrays with the stored observations:

def getReplays(numPlays):
    replays = random.sample(storage, numPlays)
    cols = [[],[],[],[],[]]
    for memory in replays:
        for col, value in zip(cols, memory):
            col.append(value)
    cols = [np.array(col) for col in cols]
    return (cols[0], cols[1].reshape(-1, 1), cols[2].reshape(-1, 1), cols[3], cols[4].reshape(-1,1))

Let’s try this out really quickly:

storeObservation(np.array([0.1, 0.1, 0.1, 0.1]), 0, 1.0, np.array([0.2, 0.2, 0.2, 0.2]), 0)
storeObservation(np.array([0.1, 0.1, 0.1, 0.1]), 1, 1.0, np.array([0, 0, 0, 0]), 0)

getReplays(2)

(array([[0.1, 0.1, 0.1, 0.1], [0.1, 0.1, 0.1, 0.1]]), array([[1], [0]]), array([[1.], [1.]]), array([[0. , 0. , 0. , 0. ], [0.2, 0.2, 0.2, 0.2]]), array([[0], [0]]))

Time to learn

We now have all the things we need to train our agent to play pole stick.

Our agent gets a state of the system which it uses to predict the qValues. Based on the qValues, we use our helper function chooseAction() to pick the best (or a random) action. We then apply that action and receive the nextState, a reward and whether our game is over. We add all the rewards of one episode (one game) and store our observation to later learn from it.

    qValues = player.predict(state.reshape(1, stateSize))[0]
    action = chooseAction(qValues, epsilon)
    nextState, reward, gameOver, _ = env.step(action)
    currentRewards += reward
    storeObservation(state, action, reward, nextState, gameOver)

Once we have collected a few of these observations, we can learn from them. To do so, we predict the qValues of the nextState we received in our memory. We pick the maxQValues of that state, because we would take the best action. Now remember our equation $Q(s,a) = r + \gamma \max\limits_{a’} Q(s’,a’)$. We apply it here to calculate the expectedValues - these are the values we should predict. To compare, we predict the actualValues by predicting Q(s,a) for our given states. As we took only a single action in each memory, we update the actualValues to be the expectedValues at the given action. Finally, we train our player to fit these values to the states to learn to predict more like our expectations. One caveat is for gameOvers - we don’t use the expectedValues here, but rather just the reward. This is because in a final game over state, the value is just the reward we received - there is no future.

    states, actions, rewards, nextStates, gameOvers  = getReplays(replayBatchSize)
    qValuesNext = player.predict(np.array(nextStates))
    maxQValues = np.max(qValuesNext, axis=1, keepdims=True)
    expectedValues = rewards + discountRate * maxQValues
    actualValues = player.predict(np.array(states))
    for idx,i in enumerate(actions): # Put expectations for actions
        if (gameOvers[idx]):
            actualValues[idx, i] = rewards[idx]
        else:
            actualValues[idx, i] = expectedValues[idx]
    player.fit(states, actualValues, verbose=0)

Putting it all together, we can now train our network over a longer time (a couple minutes on my machine). If you want to visualize the training, you can set render = True - this will slow down the training a bit.

gameOver = True
currentRewards = 0

minEpsilon = 0.05
maxEpsilon = 0.9
totalSteps = 10000
replayBatchSize = 32
trainingInterval = 1
discountRate = 0.95
maxScore = 500

render = False

for i in range(totalSteps):
    epsilon = max(minEpsilon, maxEpsilon - (maxEpsilon - minEpsilon) * i/totalSteps)
    if (gameOver):
        print("Game episode: {}/{}, rewards: {}, epsilon: {:.2}".format(i, totalSteps, currentRewards, epsilon))
        currentRewards = 0
        env.close()
        state = env.reset()
    else:
        state = nextState   
        
    if (render):
        env.render()
        
    qValues = player.predict(state.reshape(1, stateSize))[0]
    action = chooseAction(qValues, epsilon)
    nextState, reward, gameOver, _ = env.step(action)
    currentRewards += reward
    storeObservation(state, action, reward, nextState, gameOver) 
    
    if (i < replayBatchSize or i % trainingInterval != 0):
        continue
        
    states, actions, rewards, nextStates, gameOvers  = getReplays(replayBatchSize)
    qValuesNext = player.predict(np.array(nextStates))
    maxQValues = np.max(qValuesNext, axis=1, keepdims=True)
    expectedValues = rewards + discountRate * maxQValues
    actualValues = player.predict(np.array(states))
    for idx,i in enumerate(actions): # Put expectations for actions
        if (gameOvers[idx]):
            actualValues[idx, i] = rewards[idx]
        else:
            actualValues[idx, i] = expectedValues[idx]
    player.fit(states, actualValues, verbose=0)
        
    if (currentRewards >= maxScore):
        gameOver = True # stop if the agent played for a long time
    
if (render):
    env.render(close=True)

Game episode: 0/20000, rewards: 0, epsilon: 0.9
Game episode: 69/20000, rewards: 69.0, epsilon: 0.9
Game episode: 848/20000, rewards: 35.0, epsilon: 0.86
Game episode: 3483/20000, rewards: 119.0, epsilon: 0.75
Game episode: 3524/20000, rewards: 41.0, epsilon: 0.75
Game episode: 3560/20000, rewards: 20.0, epsilon: 0.75
Game episode: 3987/20000, rewards: 57.0, epsilon: 0.73
Game episode: 4153/20000, rewards: 166.0, epsilon: 0.72
Game episode: 4299/20000, rewards: 146.0, epsilon: 0.72
Game episode: 4356/20000, rewards: 57.0, epsilon: 0.71
Game episode: 8265/20000, rewards: 37.0, epsilon: 0.55
Game episode: 8291/20000, rewards: 26.0, epsilon: 0.55
Game episode: 9800/20000, rewards: 174.0, epsilon: 0.48
Game episode: 9978/20000, rewards: 172.0, epsilon: 0.48
Game episode: 10079/20000, rewards: 101.0, epsilon: 0.47
Game episode: 10245/20000, rewards: 166.0, epsilon: 0.46
Game episode: 10613/20000, rewards: 328.0, epsilon: 0.45
Game episode: 10771/20000, rewards: 158.0, epsilon: 0.44
Game episode: 11016/20000, rewards: 245.0, epsilon: 0.43
Game episode: 11261/20000, rewards: 248.0, epsilon: 0.42
Game episode: 18073/20000, rewards: 235.0, epsilon: 0.13
Game episode: 18472/20000, rewards: 355.0, epsilon: 0.11
Game episode: 18799/20000, rewards: 375.0, epsilon: 0.1
Game episode: 19286/20000, rewards: 492.0, epsilon: 0.08\

As you can see, over time, the rewards are rising as the agent learns how to play the game :)

Finally, we have our trained agent. You can now lean back, enjoy and watch it play:

from gym import wrappers
#env = wrappers.Monitor(env, '/tmp/cartpole-experiment-1') # You can wrap it to record a video

state = env.reset()
rewards = 0
for _ in range(1000):
    env.render()
    qVals = player.predict(state.reshape(1, stateSize))[0] 
    action = chooseAction(qVals, 0)
    state, reward, gameOver, _ = env.step(action)
    rewards += reward
    if gameOver:
        env.close()
        env.reset()
        print(rewards)
        break

490.0 (! compare that to 16.0 when acting randomly)

If you want to get more involved or even train on a bigger game like Pacman, check out my deep Q learning implementation on Github: DQN library