Q-learning
We apply a policy to explore the environment in order to collect information and we keep a progressively updated estimation of the optimal Q-function applying a sample-based version of the Bellman equation.
At the beginning, the table $Q$ is initialized with random values and the at time $t$:
$$Q^*(s,a) \gets (1-\alpha)Q^*(s,a) + \alpha(r + \gamma \times max_{a' \in A}{Q^*(s,a')})$$
Where $\alpha$ is the learning rate.