Skip to main content

Reinforcement learning

Reinforcement learning deals with sequential decision making problems where an agent:

  1. Observes the state of the environment
  2. Acts on the environment performing some action
  3. The environment evolves in a new state and provides the agent a reward, a feedback that the agent will use to decide if a specific action is good in a given state

The goal of the agent is: learning a policy (a prescription of actions, what action is best in every state) which maximizes the total reward.

Agent behaviour (policies)

The agent behaviour is modelled with the policy. A policy π\pi is a function taking a state and an action and giving a probability, π[0,1]\pi \to [0, 1], such that π(as)\pi(a|s) is the probability to perform action aa in state ss.

Value function and action value function

The total reward is typically defined s the expected discounted cumulative reward. We can define the value function Vπ(s)V^\pi (s) of policy π\pi in state ss:

Vπ(s)=E[t=0+γtR(st,at)s0=s,atπ(.,st)]V^\pi (s) = \mathbb{E}[\sum_{t=0}^{+\infty}{\gamma^t R(s_t, a_t)} \mid s_0 = s, a_t \in \pi(., s_t)]

  • E\mathbb{E} because it is expected (stochastic)
  • t=0+\sum_{t=0}^{+\infty} because it is a cumulative reward on a sequence of interactions
  • γt\gamma^t because it is discounted. γ[0,1]\gamma \in [0,1] so γt\gamma^t gets smaller and smaller
  • R(st,at)R(s_t, a_t) is the reward performing ata_t is sts_t

The state action function Qπ(s,a)Q^\pi(s,a) is defined as:

Qπ(s,a)=E[t=0+γtR(st,at)s0=s,a0=a,atπ(.,st)]Q^\pi (s,a) = \mathbb{E}[\sum_{t=0}^{+\infty}{\gamma^t R(s_t, a_t)} \mid s_0 = s, a_0=a, a_t \in \pi(., s_t)]

Note that: Vπ(s)=E[aAπ(a,s)Qπ(s,a)]V^\pi (s) =\mathbb{E}[\sum_{a \in A} \pi(a,s)Q^\pi (s,a)]

Reinforcement learning assumes that Qπ(s,a)Q^\pi (s,a) is represented as a table but the number of possible inputs can be huge! We cannot afford to compute an exact Qπ(s,a)Q^\pi (s,a) table.

Policy optimality

A policy π\pi^* is optimal if and only if it maximizes the expected discounted cumulative reward:

πargmaxπVπ(s)\pi^* \in argmax_\pi V^\pi(s)

Therefore, we denote:

  • V(s):=Vπ(s)V^*(s):=V^{\pi^*}(s) the value function of the optimal policy
  • Q(s,a):=Qπ(s,a)Q^*(s,a):=Q^{\pi^*}(s,a) the value action function of the optimal policy

The optimal action to be played in a gives state s, given by π(s)\pi^*(s), can be defined in terms of the optimal q-function: π(s)argmaxaQ(s,a)\pi^*(s) \in argmax_a Q^*(s,a)

In other words, we fix the state and then find the optimal action that maximizes the q-function.

Bellman equation

It is a recursive way to define the optimal q-function.

Q(s,a)=R(s,a)+γsSP(ss,a)maxaAQ(s,a)Q^*(s,a) = R(s,a) + \gamma \sum_{s^ \in S}{P(s'|s, a)max_{a' \in A}{Q^*(s',a')}}

The maximum amount of cumulative reward that you can collect starting from a state ss and an action aa is the immediate reward R(s,a)R(s,a) plus the discount factor times "what you will collect by playing an optimal policy starting from the next state".