I am looking at the different existing methods of action selection in reinforcement learning.
I found several methods like epsilon-greedy, softmax, upper confidence bound and Thompson sampling.
I managed to understand the principle of each method except Thompson sampling.
I can't understand the principle and the way it works and its action selection steps.
If you can explain to me the principle and the functioning of Thompson sampling with a simple example I would be grateful.