Value function and Q-function

The total reward is typically defined s the expected discounted cumulative reward. We can define the value function $V^\pi (s)$ of policy $\pi$ in state $s$:

$$V^\pi (s) = \mathbb{E}[\sum_{t=0}^{+\infty}{\gamma^t R(s_t, a_t)} \mid s_0 = s, a_t \in \pi(., s_t)]$$

$\mathbb{E}$ because it is expected (stochastic)
$\sum_{t=0}^{+\infty}$ because it is a cumulative reward on a sequence of interactions
$\gamma^t$ because it is discounted. $\gamma \in [0,1]$ so $\gamma^t$ gets smaller and smaller
$R(s_t, a_t)$ is the reward performing $a_t$ is $s_t$

The state action function $Q^\pi(s,a)$ is defined as:

$$Q^\pi (s,a) = \mathbb{E}[\sum_{t=0}^{+\infty}{\gamma^t R(s_t, a_t)} \mid s_0 = s, a_0=a, a_t \in \pi(., s_t)]$$

Note that: $V^\pi (s) =\mathbb{E}[\sum_{a \in A} \pi(a,s)Q^\pi (s,a)]$

Reinforcement learning assumes that $Q^\pi (s,a)$ is represented as a table but the number of possible inputs can be huge! We cannot afford to compute an exact $Q^\pi (s,a)$ table.

Policy optimality

A policy $\pi^*$ is optimal if and only if it maximizes the expected discounted cumulative reward:

$$\pi^* \in argmax_\pi V^\pi(s)$$

Therefore, we denote:

$V^*(s):=V^{\pi^*}(s)$ the value function of the optimal policy
$Q^*(s,a):=Q^{\pi^*}(s,a)$ the value action function of the optimal policy

The optimal action to be played in a gives state s, given by $\pi^*(s)$, can be defined in terms of the optimal q-function: $$\pi^*(s) \in argmax_a Q^*(s,a)$$

In other words, we fix the state and then find the optimal action that maximizes the q-function.

Bellman equation

It is a recursive way to define the optimal q-function.

$$Q^*(s,a) = R(s,a) + \gamma \sum_{s' \in S}{P(s'|s, a)max_{a' \in A}{Q^*(s',a')}}$$

The maximum amount of cumulative reward that you can collect starting from a state $s$ and an action $a$ is the immediate reward $R(s,a)$ plus the discount factor times "what you will collect by playing an optimal policy starting from the next state".

Search problems

Introduction

Breadth-first search

Uniform-cost search

Depth-first search

Depth-limited and iterative deepening search

Bidirectional search

Summary

Introduction

A* search

Introduction

Minimax search

alpha-beta pruning

Monte Carlo Tree Search

Stochastic games

Introduction and Markov decision proceses

Value function and Q-function

Q-learning

Introduction

Constraint Propagation: Inference in CSPs

Backtracking Search for CSPs

Introduction to uncertainty

Bayesian Networks

Uncertainty over time

Example problems on uncertainty

Introduction

Propositional Logic

First order logic

Model checking

Theorem proving

Classical Planning and PDDL

Algorithms for Classical Planning

SATPlan

Value function and Q-function

Policy optimality

Bellman equation

No Comments