Introduction and Markov decision proceses
Reinforcement learning deals with sequential decision making problems where an agent:
- Observes the state of the environment
- Acts on the environment performing some action
- The environment evolves in a new state and provides the agent a reward, a feedback that the agent will use to decide if a specific action is good in a given state
The goal of the agent is: learning a policy (a prescription of actions, what action is best in every state) which maximizes the total reward.
Markow decision processes
- S state space: the possible states
- A action space: the possible actions
- P transition model: which is the next state when a given action in given state. We allow a the transition model to be a stochastic function. $P(s^{'}|s,a) \to [0,1]$
- R reward function: $R(s, a) \to \real$
- $\gamma$ discount factor: how much I care about the future rewards compared to the immediate rewards. $\gamma \in [0, 1]$
- $\mu_0$ initial state distribution: $\mu_0(S) \to [0,1]$ is the probability that the interactions starts in state S.
In this model we are in a single agent environment and the transition model satisfies the Markov property: the distribution of the next state depends only on the current state and the action, it is independent from the history of states actions observed so far.
P and R are unknown: in order to play in the environment the agent must interact and learn.
Agent behaviour (policies)
The agent behaviour is modelled with the policy. A policy $\pi$ is a function taking a state and an action and giving a probability, $\pi \to [0, 1]$, such that $\pi(a|s)$ is the probability to perform action $a$ in state $s$.