What is the dimensionality of these derivatives in the paper "Active Learning for Reward Estimation in Inverse Reinforcement Learning"?

Question

I'm trying to implement in code part of the following paper: Active Learning for Reward Estimation in Inverse Reinforcement Learning. I'm specifically referring to section 2.3 of the paper.

Let's define $\mathcal{X}$ as the set of states, and $\mathcal{A}$ as the set of actions. We then sample a set of observations $\mathcal{D}$ from an agent which follows an optimal policy.

$$ \mathcal{D}=\left\{\left(x_{1}, a_{1}\right),\left(x_{2}, a_{2}\right), \ldots,\left(x_{n}, a_{n}\right)\right\} $$

Our goal is to find the reward vector $\mathbf{r}$ s.t. the total likelihood $\Lambda_{r}(\mathcal{D})$ is maximised (every time we compute a new $\mathbf{r}$, the likelihood is updated by computing the action-value function $Q_{r}^{*}$ and taking the softmax).

$$ L_{r}(x, a)=\mathbb{P}[(x, a) \mid r]=\frac{e^{\eta Q_{r}^{*}(x, a)}}{\sum_{b \in A} e^{\eta Q_{r}^{*}(x, b)}} $$

$$ \Lambda_{r}(\mathcal{D})=\sum_{\left(x_{i}, a_{i}\right) \in \mathcal{D}} \log \left(L_{r}\left(x_{i}, a_{i}\right)\right) $$

Then, the paper suggests how to compute the derivatives w.r.t. $\mathbf{r}$ by defining the following quantities:

$$ \left[\nabla_{r} \Lambda_{r}(\mathcal{D})\right]_{x a}=\sum_{\left(x_{i}, a_{i}\right) \in \mathcal{D}} \frac{1}{L_{r}\left(x_{i}, a_{i}\right)} \frac{\partial L_{r}\left(x_{i}, a_{i}\right)}{\partial r_{x a}} $$

$$ \nabla_{r} L_{r}(x, a)=\frac{d L_{r}}{d Q^{*}}(x, a) \frac{d Q^{*}}{d r}(x, a) $$

Then, considering $\mathbf{T}=\mathbf{I}-\gamma \mathbf{P}_{\pi^{*}}$

$$ \frac{\partial Q^{*}}{\partial r_{z u}}(x, a)=\delta_{z u}(x, a)+\gamma \sum_{y \in \mathcal{X}} \mathrm{P}_{a}(x, y) \mathbf{T}^{-1}(y, z) \pi^{*}(z, u) $$

$$ \frac{d L_{r}}{d Q_{y b}^{*}}(x, a)=\eta L_{r}(x, a)\left(\delta_{y b}(x, a)-L_{r}(y, b) \delta_{y}(x)\right) $$

with $x, y \in \mathcal{X}$ and $a, b \in \mathcal{A} .$ In the above expression, $\delta_{u}(v)$ denotes the Kronecker delta function.

Finally, the update is trivially computed by $$ \mathbf{r}_{t+1}=\mathbf{r}_{t}+\alpha_{t} \nabla_{r} \Lambda_{r_{t}}(\mathcal{D}) $$

Here I suppose that the paper's author is considering $\mathbf{r}$ as a matrix of dimension number of states $\times$ number of actions (i.e. each element of this matrix represents $R(s,a)$)

My question is: what is the dimensionality of $\frac{d L_{r}}{d Q^{*}}(x, a)$ and $\frac{d Q^{*}}{d r}(x, a)$? (is that a point-wise product, a matrix-matrix product, a vector-matrix product?)

The more reasonable solution, dimensionally speaking, for me would be something like: $$ \nabla_{r} L_{r}(x, a)=\\ \frac{d L_{r}}{d Q^{*}}(x, a) \frac{d Q^{*}}{d r}(x, a) = \\ \left(\sum_{s'\in\mathcal{X}}\sum_{a'\in\mathcal{A}}\frac{d L_{r}}{d Q^{*}_{s'a'}}(x, a)\right) \begin{bmatrix} \frac{d Q^{\star}}{d r_{s_1a_1}}(x, a) & \dots &\frac{d Q^{\star}}{d r_{s_1a_m}}(x, a) \\ \vdots& \ddots & \vdots \\ \frac{d Q^{\star}}{d r_{s_na_1}}(x, a) & \dots & \frac{d Q^{\star}}{d r_{s_na_m}}(x, a) \end{bmatrix} $$

(where $n = |\mathcal{X}|$ and $m = |\mathcal{A}|$)

What is the dimensionality of these derivatives in the paper "Active Learning for Reward Estimation in Inverse Reinforcement Learning"?

0 Answers0

Linked