# Monte Carlo in Reinforcement Learning, the Easy Way

In Dynamic Programming (DP) we have seen that in order to compute the value function on each state, we need to know the **transition matrix** as well as the **reward system**. But this is not always a realistic condition. Probably it is possible to have such thing in some board games, but in video games and real life problems like self-driving car there is no way to know these information before hand.

If you recall the formula of the State-Value function from “Math Behind Reinforcement Learning” article:

It is not possible to compute the ** V(s)** because

**is now unknown to us.**

*p(s’,r|s,a)*Always keep in mind that our goal is to find the policy that maximizes the reward for an agent. We said in previous articles that analytical solution is hard to get, so we fallback to iterative solutions such as Dynamic Programming. However DP has its own problems as mentioned above.

An alternative solution is to play enough number of episodes of the game and extract the information needed. Notice that in DP we didn’t play the game because we knew its dynamics, in other words at each state we knew what are the probabilities of going to another state when we take certain action, and we knew what the reward is going to be. Based on that we were able to do our calculations. In this new scenario, we won’t know these data unless we play the game. This is a key difference between Monte Carlo and Dynamic Programming.

In Monte Carlo (MC) we play an episode of the game starting by some random state (not necessarily the beginning) till the end, record the states, actions and rewards that we encountered then compute the V(s) and Q(s) for each state we passed through.

We repeat this process by playing more episodes and after each episode we get the states, actions, and rewards and we average the values of the discovered V(s) and Q(s).

In Monte Carlo there is no guarantee that we will visit all the possible states, another weakness of this method is that we need to wait until the game ends to be able to update our V(s) and Q(s), this is problematic in games that never ends.

# Policy Evaluation using MC

As a reminder a “Policy” is a plan, a set of actions that the agent takes to move through the states. The optimal policy is the set of actions that allows the agent to maximize his rewards.

Evaluating the policy is the process of calculating the value of each state when the agent follows the set of actions dictated by that policy.

The way we do that in MC, is to play episodes of the game until we reach the end, then we grab the rewards collected on the way and move backward to the start of the episode, while affecting the discounted value at each state.

We repeat this method a sufficient number of times and we average the value of each state.

Lets take a concrete example:

Suppose you already have a policy and you need to evaluate it.

The evaluation of the above policy is to randomly select a state and follow the policy until the end (terminal state), then compute the discounted reward for each step you passed through. This goes like this:

At the end you average the values on each state and you will get the following:

Notice that there is no guarantee to pass by all states. For example the value at state (**??**) is unknown because it did not happen to get selected during the random selection.

Evaluate_Policy(policy):

let s = select_random_state()

let a = policy[s]

let sr_list = [] # init empty list for state, returns Loop:

# Play episode until the end, record every state, action

# and reward:

let (s, r) = make_move(s, a) #make a mov from s using a and get

#the new state s and the reward r

if game_over():

sr_list.add(s, r)

break

else:

let a = policy[s]

sr_list.add(s, r)

End Loop # compute the return

let G = 0

let sg_list = [] # init empty list of state, return for s, r in sr_list.reverse(): #loop from the end

G = r + gamma * G

sg_list.add(s, G) #return the computed list

return sg_list.reverse()

# Need for Action-Value in MC

However there is a problem with using only ** V(s)**.

How would you select which action that leads from state

**Si**to state

**?**

*Sj*In Grid World you might be tempted to look around and select the action that leads to the state with biggest value. In a deterministic Grid World this might work because if the upper state has the biggest value among the states that surround the current one, going up will be the logical choice.

But suppose along with moving actions, you also having jumping actions. Which one would you choose, moving up or jumping up? If you have a wall and you moved up, you won’t go anywhere, you have to jump. Conversely if you have a low ceiling and you jump, you will hit it and fall, you would rather move.

To push it even further, consider a windy Grid World, and whatever action you take, you will be drifting to a state other than the intended one. For example, if you try to go up you might end going to the right. So looking at the value of the upper cell and decide to go up will not work. You might need to make a left move in order to be pushed to the upper cell.

The only way to know that, is to have access at the **transition matrix P**. But we said from the beginning that this information is unknown to us and this a major reason for why DP is not practical.

Once you convince yourself of this fact, you will see that the use of Q(s, a) function instead of V(s) is a better fit, because we compute the value of each action on each state. This way we will be able to check which action produces the maximum reward and thus include it in the optimal policy.

Q-Value function can be computed in two different way:

- With Exploring Start
- Without Exploring Start

# MC with Exploring Start

Exploring start is the method of estimating the Q-Value function by randomly starting at any state, then greedily choose the best action to take.

The algorithm of policy evaluation is like this:

Evaluate_Policy_With_Exploring_Start(policy):

let s = select_random_state()

let a = select_random_action_for_state(s)

let sar_list = [] # init empty list for state, action, reward Loop:

# Play episode until the end, record every state, action

# and reward:

let (s, r) = make_move(s, a) #make a mov from s using a and get

#the new state s and the reward r

if game_over():

sar_list.add(s, none, r)

break

else:

let a = policy[s]

sar_list.add(s, a, r)

End Loop # compute the return

let G = 0

let sag_list = []# init empty list of state, action, return for s, a, r in sar_list.reverse(): #loop from the end

G = r + gamma * G

sag_list.add(s, a, G) #return the computed list

return sag_list.reverse()

# MC without Exploring Start

In case you noticed an issue with the Exploring Start approach, then you are right. In practice you can’t start at any state! In a video game for example there is one or few states to start at, but certainly you can’t start at any random state. So the Exploring Start is an unrealistic assumption.

To deal with this constraint, instead of starting at a random state, we start at the beginning then explore “epsilon” part of the time, and exploit “1-epsilon” amount of the time. As you remember, exploration means you act randomly while exploitations means you act greedily (take the action that produces the highest result).

Evaluate_Policy_Without_Exploring_Start(policy):

let s = select_start_state() # select the starting state, not a

# random state

let a = select_random_action_for_state(s)

let sar_list = [] # init empty list for state, action, reward Loop:

# Play episode until the end, record every state, action

# and reward:

let (s, r) = make_move(s, a) #make a move from s using a, get

#the new state s and the reward r

if game_over():

sar_list.add(s, none, r)

break

else:

# select random action epsilon time and policy[s] (greedy)

# (1-epsilon) time

let a = epsilon_random_action(policy[s])

sar_list.add(s, a, r)

End Loop # compute the return

let G = 0

let sag_list = [] # init empty list of state, action, return for s, a, r in sar_list.reverse(): #loop from the end

G = r + gamma * G

sag_list.add(s, a, G) #return the computed list

return sag_list.reverse()

# Policy Improvement using MC

Policy improvement is the process of changing the parts of a policy in order to get a better one. By “better” we mean that the new policy has greater value for at least one state, while the values of other states are equal to the values computed by the previous policy. We keep improving the policy iteratively until we reach the optimal policy (the policy that gives the best state values among all other policies).

The policy improvement in the case of Exploring Start, goes by selecting a random policy, then call Evaluate_Policy_With_Exploring_Start(policy) large number of times, with each time get the list of states, actions and rewards and compute the average value for each state s and action a. Then update the policy with action that produces the largest Q value for each state.

Improve_Policy_With_Exploring_Start(): # create a random policy starting. Random action for each state

let policy = select_random_action_for_each_state()

let Q = [][]

let r = [] Loop large number of times:

# Evaluate the policy :

let sag_list = Evaluate_Policy_With_Exploring_Start(policy)

let visited_states_actions = [] For Each s, a, G in sag_list:

if not visited_states_actions.contains(s, a):

r[s,a].append(G)

Q[s][a] = average(r[s, a])

visited_states_actions.add(s, a)

End For

# update policy

For Each s in policy:

# select action that produces the max Q value for state s

policy[s] = action_with_max(Q[s])

End For End Loop

The policy improvement in the case without Exploring Start, is similar to the one of Exploring Start but this time we call Evaluate_Policy_Without_Exploring_Start(policy) .

Improve_Policy_Without_Exploring_Start(): # create a random policy starting. Random action for each state

let policy = select_random_action_for_each_state()

let Q = [][]

let r = [] Loop large number of times:

# Evaluate the policy :

let sag_list = Evaluate_Policy_Without_Exploring_Start(policy)

let visited_states_actions = [] For Each s, a, G in sag_list:

if not visited_states_actions.contains(s, a):

r[s,a].append(G)

Q[s][a] = average(r[s, a])

visited_states_actions.add(s, a)

End For

# update policy

For Each s in policy:

# select action that produces the max Q value for state s

policy[s] = action_with_max(Q[s])

End ForEnd Loop

# Conclusion

Monte Carlo method has an advantage over Dynamic Programming as it does not have to know the transition probabilities and the reward system before hand. The requirement of such information is somehow unrealistic.