Skip to main content

Value function and Q-function

The total reward is typically defined s the expected discounted cumulative reward. We can define the value function $V^\pi (s)$ of policy $\pi$ in state $s$:

$$V^\pi (s) = \mathbb{E}[\sum_{t=0}^{+\infty}{\gamma^t R(s_t, a_t)} \mid s_0 = s, a_t \in \pi(., s_t)]$$

  • $\mathbb{E}$ because it is expected (stochastic)
  • $\sum_{t=0}^{+\infty}$ because it is a cumulative reward on a sequence of interactions
  • $\gamma^t$ because it is discounted. $\gamma \in [0,1]$ so $\gamma^t$ gets smaller and smaller
  • $R(s_t, a_t)$ is the reward performing $a_t$ is $s_t$

The state action function $Q^\pi(s,a)$ is defined as:

$$Q^\pi (s,a) = \mathbb{E}[\sum_{t=0}^{+\infty}{\gamma^t R(s_t, a_t)} \mid s_0 = s, a_0=a, a_t \in \pi(., s_t)]$$

Note that: $V^\pi (s) =\mathbb{E}[\sum_{a \in A} \pi(a,s)Q^\pi (s,a)]$

Reinforcement learning assumes that $Q^\pi (s,a)$ is represented as a table but the number of possible inputs can be huge! We cannot afford to compute an exact $Q^\pi (s,a)$ table.

Policy optimality

A policy $\pi^*$ is optimal if and only if it maximizes the expected discounted cumulative reward:

$$\pi^* \in argmax_\pi V^\pi(s)$$

Therefore, we denote:

  • $V^*(s):=V^{\pi^*}(s)$ the value function of the optimal policy
  • $Q^*(s,a):=Q^{\pi^*}(s,a)$ the value action function of the optimal policy

The optimal action to be played in a gives state s, given by $\pi^*(s)$, can be defined in terms of the optimal q-function: $$\pi^*(s) \in argmax_a Q^*(s,a)$$

In other words, we fix the state and then find the optimal action that maximizes the q-function.

Bellman equation

It is a recursive way to define the optimal q-function.

$$Q^*(s,a) = R(s,a) + \gamma \sum_{s^s' \in S}{P(s'|s, a)max_{a' \in A}{Q^*(s',a')}}$$

The maximum amount of cumulative reward that you can collect starting from a state $s$ and an action $a$ is the immediate reward $R(s,a)$ plus the discount factor times "what you will collect by playing an optimal policy starting from the next state".