2

I'm trying to implement in code part of the following paper: Active Learning for Reward Estimation in Inverse Reinforcement Learning. I'm specifically referring to section 2.3 of the paper.

Let's define $\mathcal{X}$ as the set of states, and $\mathcal{A}$ as the set of actions. We then sample a set of observations $\mathcal{D}$ from an agent which follows an optimal policy.

$$ \mathcal{D}=\left\{\left(x_{1}, a_{1}\right),\left(x_{2}, a_{2}\right), \ldots,\left(x_{n}, a_{n}\right)\right\} $$

Our goal is to find the reward vector $\mathbf{r}$ s.t. the total likelihood $\Lambda_{r}(\mathcal{D})$ is maximised (every time we compute a new $\mathbf{r}$, the likelihood is updated by computing the action-value function $Q_{r}^{*}$ and taking the softmax).

$$ L_{r}(x, a)=\mathbb{P}[(x, a) \mid r]=\frac{e^{\eta Q_{r}^{*}(x, a)}}{\sum_{b \in A} e^{\eta Q_{r}^{*}(x, b)}} $$

$$ \Lambda_{r}(\mathcal{D})=\sum_{\left(x_{i}, a_{i}\right) \in \mathcal{D}} \log \left(L_{r}\left(x_{i}, a_{i}\right)\right) $$

Then, the paper suggests how to compute the derivatives w.r.t. $\mathbf{r}$ by defining the following quantities:

$$ \left[\nabla_{r} \Lambda_{r}(\mathcal{D})\right]_{x a}=\sum_{\left(x_{i}, a_{i}\right) \in \mathcal{D}} \frac{1}{L_{r}\left(x_{i}, a_{i}\right)} \frac{\partial L_{r}\left(x_{i}, a_{i}\right)}{\partial r_{x a}} $$

$$ \nabla_{r} L_{r}(x, a)=\frac{d L_{r}}{d Q^{*}}(x, a) \frac{d Q^{*}}{d r}(x, a) $$

Then, considering $\mathbf{T}=\mathbf{I}-\gamma \mathbf{P}_{\pi^{*}}$

$$ \frac{\partial Q^{*}}{\partial r_{z u}}(x, a)=\delta_{z u}(x, a)+\gamma \sum_{y \in \mathcal{X}} \mathrm{P}_{a}(x, y) \mathbf{T}^{-1}(y, z) \pi^{*}(z, u) $$

$$ \frac{d L_{r}}{d Q_{y b}^{*}}(x, a)=\eta L_{r}(x, a)\left(\delta_{y b}(x, a)-L_{r}(y, b) \delta_{y}(x)\right) $$

with $x, y \in \mathcal{X}$ and $a, b \in \mathcal{A} .$ In the above expression, $\delta_{u}(v)$ denotes the Kronecker delta function.

Finally, the update is trivially computed by $$ \mathbf{r}_{t+1}=\mathbf{r}_{t}+\alpha_{t} \nabla_{r} \Lambda_{r_{t}}(\mathcal{D}) $$

Here I suppose that the paper's author is considering $\mathbf{r}$ as a matrix of dimension number of states $\times$ number of actions (i.e. each element of this matrix represents $R(s,a)$)

My question is: what is the dimensionality of $\frac{d L_{r}}{d Q^{*}}(x, a)$ and $\frac{d Q^{*}}{d r}(x, a)$? (is that a point-wise product, a matrix-matrix product, a vector-matrix product?)

The more reasonable solution, dimensionally speaking, for me would be something like: $$ \nabla_{r} L_{r}(x, a)=\\ \frac{d L_{r}}{d Q^{*}}(x, a) \frac{d Q^{*}}{d r}(x, a) = \\ \left(\sum_{s'\in\mathcal{X}}\sum_{a'\in\mathcal{A}}\frac{d L_{r}}{d Q^{*}_{s'a'}}(x, a)\right) \begin{bmatrix} \frac{d Q^{\star}}{d r_{s_1a_1}}(x, a) & \dots &\frac{d Q^{\star}}{d r_{s_1a_m}}(x, a) \\ \vdots& \ddots & \vdots \\ \frac{d Q^{\star}}{d r_{s_na_1}}(x, a) & \dots & \frac{d Q^{\star}}{d r_{s_na_m}}(x, a) \end{bmatrix} $$

(where $n = |\mathcal{X}|$ and $m = |\mathcal{A}|$)

0 Answers0