In some implementations of off-policy Q-learning, we need to know the action probabilities given by the behavior policy $\mu(a)$ (e.g., if we want to use importance sampling).
In my case, I am using Deep Q-Learning and selecting actions using Thompson Sampling. I implemented this following the approach in "What My Deep Model Doesn't Know...": I added dropout to my Q-network and select actions by performing a single stochastic forward pass through the Q-network (i.e., with dropout enabled) and choosing the action with the highest Q-value.
So, how can I calculate $\mu(a)$ when using Thompson Sampling based on dropout?