Basics of Reinforcement Learning, the Easy Way

9 min readAug 29, 2018

Update: The best way of learning and practicing Reinforcement Learning is by going to

Reinforcement Learning (RL) is the problem of studying an agent in an environment, the agent has to interact with the environment in order to maximize some cumulative rewards.

Example of RL is an agent in a labyrinth trying to find its way out. The fastest it can find the exit, the better reward it will get.

Markov Decision Process (MDP)

To describe this problem in a mathematical way, we use Markov Decision Process (MDP).
MDP describes the environment as follows.

  • MDP is a collection of States, Actions, Transition Probabilities, Rewards, Discount Factor: (S, A, P, R, γ)
  • S is a set of a finite state that describes the environment.
  • A is a set of a finite actions that describes the action that can be taken by the agent.
  • P is a probability matrix that tells the probability of moving from one state to the other.
  • R is a set of rewards that depend on the state and the action taken. Rewards are not necessarily positive, they should be seen as outcome of an action done by the agent when it is at a certain state. So negative reward indicates bad result, whereas positive reward indicates good result.
  • γ is a discount factor, that tells how important future rewards are to the current state. Discount factor is a value between 0 and 1. A reward R that occurs N steps in the future from the current state, is multiplied by γ^N to describe its importance to the current state. For example consider γ = 0.9 and a reward R = 10 that is 3 steps ahead of our current state. The importance of this reward to us from where we stand is equal to (0.9³)*10 = 7.29.

Value Functions

Now with the MDP in place, we have a description of the environment but still we don’t know how the agent should act in this environment.
The rule we impose on the agent is that it must act in a way to…