Action selection in Batch-Constrained Deep Q-learning (BCQ)

Question

For simplicity, let's consider the discrete version of BCQ where the paper and the code are available. In the line 5 of Algorithm 1 we have the following:

$$ a' = \text{argmax}_{a'|G_{\omega}(a', s')/\text{max}~\hat{a}~G_{\omega}(\hat{a}, s')~>~\tau} Q_{\theta}(s', a') $$

I have doubts about the predictive model $G_{\omega}$. In behavior cloning, it could be an action obtained using supervised learning mapping states to actions. Instead, in BCQ it looks like an probability regarding the actions available. I'm right? And what is the action $\hat{a}$?

EDIT: As far I understood we compare the probability of the action $a'$ to $\hat{a}$, and if it's above the threshold $\tau$ we calculate the loss. The question is now how should I proceed if this is not true: should I take the next action Q with the highest value?

Can you please put your **specific** question in the title rather than writing "doubts regarding..."? — nbro, Jan 27 '22 at 22:15

HenDoNR · Answer 1 · 2022-02-16T13:09:50.720

The algorithm can be summarized by the following equation, as done in this post:

$$\mathcal{L}(\theta) = \ell_k \left( r + \gamma . \left(max_{a'~s.t. \frac{G_\omega(a'|s')}{max~\hat{a}~G_\omega(\hat{a}|s')} > \tau}{Q_{\theta'}}(s', a') \right) - Q_{\theta}(s, a) \right)$$

Let's imagine that we have the following $Q_{\theta}(s, a)$ for 3 discrete actions: q = np.array([15.2, 15, 15.4])

In the generative model $G_\omega$ the output probabilities are: p = np.array([0.2, 0.5, 0.3])

In the first case, let's consider $\tau$ being 0.3, thus we have:

p = (p/np.amax(p) > tau)
max_future_q = np.amax(p * q + (1. - p) * -1e8)

Thus, we have max_future_q equal to 15.4. Depending on the values of $Q_{\theta}$ and $p$, we filter the action according to $tau$ s.t. $tau = 0$ is $Q_{\theta}$ and $tau = 1$ is behavioral cloning.

Action selection in Batch-Constrained Deep Q-learning (BCQ)

1 Answers1