Reinforcement learning
Reinforcement learning deals with sequential decision making problems where an agent:
- Observes the state of the environment
- Acts on the environment performing some action
- The environment evolves in a new state and provides the agent a reward, a feedback that the agent will use to decide if a specific action is good in a given state
The goal of the agent is: learning a policy (a prescription of actions, what action is best in every state) which maximizes the total reward.
Agent behaviour (policies)
The agent behaviour is modelled with the policy. A policy is a function taking a state and an action and giving a probability, , such that is the probability to perform action in state .
Value function and action value function
The total reward is typically defined s the expected discounted cumulative reward. We can define the value function of policy in state :
- because it is expected (stochastic)
- because it is a cumulative reward on a sequence of interactions
- because it is discounted. so gets smaller and smaller
- is the reward performing is
The state action function is defined as:
Note that:
Reinforcement learning assumes that is represented as a table but the number of possible inputs can be huge! We cannot afford to compute an exact table.
Policy optimality
A policy is optimal if and only if it maximizes the expected discounted cumulative reward:
Therefore, we denote:
- the value function of the optimal policy
- the value action function of the optimal policy
The optimal action to be played in a gives state s, given by , can be defined in terms of the optimal q-function:
In other words, we fix the state and then find the optimal action that maximizes the q-function.
Bellman equation
It is a recursive way to define the optimal q-function.
The maximum amount of cumulative reward that you can collect starting from a state and an action is the immediate reward plus the discount factor times "what you will collect by playing an optimal policy starting from the next state".