How to compute the action probabilities with Thompson sampling in deep Q-learning?

Question

In some implementations of off-policy Q-learning, we need to know the action probabilities given by the behavior policy $\mu(a)$ (e.g., if we want to use importance sampling).

In my case, I am using Deep Q-Learning and selecting actions using Thompson Sampling. I implemented this following the approach in "What My Deep Model Doesn't Know...": I added dropout to my Q-network and select actions by performing a single stochastic forward pass through the Q-network (i.e., with dropout enabled) and choosing the action with the highest Q-value.

So, how can I calculate $\mu(a)$ when using Thompson Sampling based on dropout?

score 1 · Accepted Answer · edited Jan 05 '21 at 12:13

So, how can I calculate $\mu(a)$ when using Thompson Sampling based on dropout?

The only way I could see this being calculated is if you iterate over all possible dropout combinations, or as an approximation sample say 100 or 1000 actions with different dropout, to get a rough distribution.

I don't think this is feasible for practical reasons (the agent will learn so much more slowly due to these calculations, you may as well abandon Thompson Sampling and use epsilon-greedy), and you will have to avoid using importance sampling if you also want to use action-selection techniques where there is no easy way to calculate a distribution.

Many forms of Q-learning do not use importance sampling. These typically just reset eligibility traces if the selected action is different from maximising action.

How to compute the action probabilities with Thompson sampling in deep Q-learning?

1 Answers1