Q-learning

We apply a policy to explore the environment in order to collect information and we keep a progressively updated estimation of the optimal Q-function applying a sample-based version of the Bellman equation.

At the beginning, the table $Q$ is initialized with random values and the at time $t$:

$$Q^*(s,a) \gets (1-\alpha)Q^*(s,a) + \alpha(r + \gamma \times max_{a' \in A}{Q^*(s,a')})$$

Where $\alpha$ is the learning rate.