Reinforcement learning

Reinforcement learning deals with sequential decision making problems where an agent:

Observes the state of the environment
Acts on the environment performing some action
The environment evolves in a new state and provides the agent a reward, a feedback that the agent will use to decide if a specific action is good in a given state

The goal of the agent is: learning a policy (a prescription of actions, what action is best in every state) which maximizes the total reward.

Agent behaviour (policies)

The agent behaviour is modelled with the policy. A policy $\pi$ is a function taking a state and an action and giving a probability, $\pi \to [0, 1]$, such that $\pi(a|s)$ is the probability to perform action $a$ in state $s$.

Value function and action value function

The total reward is typically defined s the expected discounted cumulative reward. We can define the value function $V^\pi (s)$ of policy $\pi$ in state $s$:

$$V^\pi (s) = \mathbb{E}[\sum_{t=0}^{+\infty}{\gamma^t R(s_t, a_t)} \mid s_0 = s, a_t \in \pi(., s_t)]$$

$\mathbb{E}$ because it is expected (stochastic)
$\sum_{t=0}^{+\infty}$ because it is a cumulative reward on a sequence of interactions
$\gamma^t$ because it is discounted. $\gamma \in [0,1]$ so $\gamma^t$ gets smaller and smaller
$R(s_t, a_t)$ is the reward performing $a_t$ is $s_t$

The state action function $Q^\pi(s,a)$ is defined as:

$$Q^\pi (s,a) = \mathbb{E}[\sum_{t=0}^{+\infty}{\gamma^t R(s_t, a_t)} \mid s_0 = s, a_0=a, a_t \in \pi(., s_t)]$$

Note that: $V^\pi (s) =\mathbb{E}[\sum_{a \in A} \pi(a,s)Q^\pi (s,a)]$

Reinforcement learning assumes that $Q^\pi (s,a)$ is represented as a table but the number of possible inputs can be huge! We cannot afford to compute an exact $Q^\pi (s,a)$ table.

Policy optimality

A policy $\pi^*$ is optimal if and only if it maximizes the expected discounted cumulative reward:

$$\pi^* \in argmax_\pi V^\pi(s)$$

Therefore, we denote:

$V^*(s):=V^{\pi^*}(s)$ the value function of the optimal policy
$Q^*(s,a):=Q^{\pi^*}(s,a)$ the value action function of the optimal policy

The optimal action to be played in a gives state s, given by $\pi^*(s)$, can be defined in terms of the optimal q-function: $$\pi^*(s) \in argmax_a Q^*(s,a)$$

In other words, we fix the state and then find the optimal action that maximizes the q-function.

Bellman equation

It is a recursive way to define the optimal q-function.

$$Q^*(s,a) = R(s,a) + \gamma \sum_{s^ \in S}{P(s'|s, a)max_{a' \in A}{Q^*(s',a')}}$$

The maximum amount of cumulative reward that you can collect starting from a state $s$ and an action $a$ is the immediate reward $R(s,a)$ plus the discount factor times "what you will collect by playing an optimal policy starting from the next state".