4

In some implementations of off-policy Q-learning, we need to know the action probabilities given by the behavior policy $\mu(a)$ (e.g., if we want to use importance sampling).

In my case, I am using Deep Q-Learning and selecting actions using Thompson Sampling. I implemented this following the approach in "What My Deep Model Doesn't Know...": I added dropout to my Q-network and select actions by performing a single stochastic forward pass through the Q-network (i.e., with dropout enabled) and choosing the action with the highest Q-value.

So, how can I calculate $\mu(a)$ when using Thompson Sampling based on dropout?

nbro
  • 39,006
  • 12
  • 98
  • 176
nicolas
  • 43
  • 2

1 Answers1

1

So, how can I calculate $\mu(a)$ when using Thompson Sampling based on dropout?

The only way I could see this being calculated is if you iterate over all possible dropout combinations, or as an approximation sample say 100 or 1000 actions with different dropout, to get a rough distribution.

I don't think this is feasible for practical reasons (the agent will learn so much more slowly due to these calculations, you may as well abandon Thompson Sampling and use epsilon-greedy), and you will have to avoid using importance sampling if you also want to use action-selection techniques where there is no easy way to calculate a distribution.

Many forms of Q-learning do not use importance sampling. These typically just reset eligibility traces if the selected action is different from maximising action.

nbro
  • 39,006
  • 12
  • 98
  • 176
Neil Slater
  • 28,678
  • 3
  • 38
  • 60