0

I was going through paper titled "Algorithms for Inverse Reinforcement Learning" by Andrew Ng and Russell.

It states following basics:

  • MDP $M$ is a tuple $(S,A,\{P_{sa}\},\gamma,R)$, where

    • $S$ is a finite seto of $N$ states
    • $A=\{a_1,...,a_k\}$ is a set of $k$ actions
    • $\{P_{sa}(.)\}$ are the transition probabilities upon taking action $a$ in state $s$.
    • $R:S\rightarrow \mathbb{R}$ is a reinforcement function (I guess its what it is also called as reward function) For simplicity in exposition, we have written rewards as $R(s)$ rather than $R(s,a)$; the extension is trivial.
  • A policy is defined as any map $\pi : S \rightarrow A$

  • Bellman Equation for Value function $V^\pi(s)=R(s)+\gamma \sum_{s'}P_{s\pi(s)}(s')V^\pi(s')\quad\quad...(1)$

  • Bellman Equation for Q function $Q^\pi(s,a)=R(s)+\gamma \sum_{s'}P_{sa}(s')V^\pi(s')\quad\quad...(2)$

  • Bellman Optimality: The policy $\pi$ is optimal iff, for all $s\in S$, $\pi(s)\in \text{argmax}_{a\in A}Q^\pi(s,a)\quad\quad...(3)$

  • All these can be represented as vectors indexed by state, for which we adopt boldface notation $\pmb{P,R,V}$.

  • Inverse Reinforcement Learning is: given MDP $M=(S,A,P_{sa},\gamma,\pi)$, finding $R$ such that $\pi$ is an optimal policy for $M$

  • By renaming actions if necessary, we will assume without loss of generality that $\pi(s) = a_1$.

Paper then states following theorem, its proof and a related remark:

Theorem: Let a finite state space $S$, a set of actions $A=\{a_1,..., a_k\}$, transition probability matrices ${\pmb{P_a}}$, and a discount factor $\gamma \in (0, 1)$ be given. Then the policy $\pi$ given by $\pi(s) \equiv a_1$ is optimal iff, for all $a = a_2, ... , a_k$, the reward $\pmb{R}$ satisfies $$(\pmb{P}_{a_1}-\pmb{P}_a)(\pmb{I}-\gamma\pmb{P}_{a_1})^{-1}\pmb{R}\succcurlyeq 0 \quad\quad ...(4)$$ Proof:
Equation (1) can be rewritten as
$\pmb{V}^\pi=\pmb{R}+\gamma\pmb{P}_{a_1}\pmb{V}^\pi$
$\therefore\pmb{V}^\pi=(\pmb{I}-\gamma\pmb{P}_{a_1})^{-1}\pmb{R}\quad\quad ...(5)$
Putting equation $(2)$ into $(3)$, we see that $\pi$ is optimal iff
$\pi(s)\in \text{arg}\max_{a\in A}\sum_{s'}P_{sa}(s')V^\pi(s') \quad...\forall s\in S$
$\iff \sum_{s'}P_{sa_1}(s')V^\pi(s')\geq\sum_{s'}P_{sa}(s')V^\pi(s')\quad\quad\quad\forall s\in S,a\in A$ $\iff \pmb{P}_{a_1}\pmb{V}^\pi\succcurlyeq\pmb{P}_{a}\pmb{V}^\pi\quad\quad\quad\forall a\in A\text{\\} a_1 \quad\quad ...(6)$ $\iff\pmb{P}_{a_1} (\pmb{I}-\gamma\pmb{P}_{a_1})^{-1}\pmb{R}\succcurlyeq\pmb{P}_{a} (\pmb{I}-\gamma\pmb{P}_{a_1})^{-1}\pmb{R} \quad\quad \text{...from (5)}$
Hence proved.

Remark: Using a very similar argument, it is easy to show (essentially by replacing all inequalities in the proof above with strict inequalities) that the condition $(\pmb{P}_{a_1}-\pmb{P}_a)(\pmb{I}-\gamma\pmb{P}_{a_1})^{-1}\pmb{R}\succ 0 $ is necessary and sufficient for $\pi\equiv a_1$ to be the unique optimal policy.

I dont know if above text from paper is relevant for what I want to prove, still I stated above text as a background.

I want to prove following:

  1. If we take $R : S → \mathbb{R}$—there need not exist $R$ such that $π^*$ is the unique optimal policy for $(S, A, T, R, γ)$
  2. If we take $R : S × A → \mathbb{R}$. Show that there must exist $R$ such that $π^*$ is the unique optimal policy for $(S, A, T, R, γ)$.

I guess point 1 follows directly from above theorem as it says "$\pi(s)$ is optimal iff ..." and not "unique optimal iff". Also, I feel it also follows from operator $\succcurlyeq$ in equation $(6)$. In addition, I feel its quite intuitive: if we have same reward for any given state for every action, then different policies choosing different actions will yield same reward from that state hence resulting in same value function.

I dont feel point 2 is correct. I guess, this directly follows from the remark above which requires additional condition to hold for $\pi$ to be "uinque optimal" and this condition wont hold if we simply define $R : S × A → \mathbb{R}$ instead of $R : S → \mathbb{R}$. Additionally, I feel, this condition will hold iff we had $=$ in equation $(3)$ instead of $\in$ (as this will replace all $\succcurlyeq$ with $\succ$ in the proof). Also this also follow directly from point 1 itself. That is we can still have same reward for all actions from given state despite defining reward as $R : S × A → \mathbb{R}$ instead of $R : S → \mathbb{R}$, which is the case with point 1.

Am I correct with the analysis in last two paragraphs?

Update

After some more thinking, I felt I was doing it all wrong. Also I feel the text from the paper which I specified is of not much help in proving these two points. So let me restate new intuition for proofs for the two points:

  1. For $R: S\rightarrow \mathbb{R}$, if some state $S_1$ and next state $S_2$ has two actions between them, $S_1-a_1\rightarrow S_2$ and $S_1-a_2\rightarrow S_2$, and if optimal policy $π_1^*$ chooses $a_1$, then $π_2^*$ choosing $a_2$ will also be optimal, thus making NONE "uniquely" optimal since both $a_1$ and $a_2$ will yield same reward as reward is associated with $S_1$ instead of with $(S_1,a_x)$.

  2. For $R: (S,A)\rightarrow\mathbb{R}$, we can assign large reward say $+∞$ to all actions specified in given $π^*$ and $-∞$ to all other actions. This reward assignment will make $π^*$ a unique optimal policy.

Are above logics correct and enough to prove given points?

Rnj
  • 221
  • 2
  • 6
  • 1
    Cross-posted: https://cs.stackexchange.com/q/136665/755, https://datascience.stackexchange.com/q/90656/8560, https://ai.stackexchange.com/q/26821/1794. Please [do not post the same question on multiple sites](https://meta.stackexchange.com/q/64068). – D.W. Mar 15 '21 at 21:57
  • 1
    Could you please simplify this post as much as possible, but still leave the relevant details to answer your specific question? This simplifies our life. For example, do we really need to quote all those details from the paper? Moreover, could also focus on 1 question per post? If you have more than one question, ask each of them in their separate post. Finally, put your **specific** question in the title, so that we immediately know what your **specific** question is? "providing facts in IRL" is very vague and not a question. – nbro Mar 16 '21 at 10:34
  • The two questions are very related. If I have to ask them on separate post, then I have to copy paste the whole background, which will be a lot of redundancy. Also any future visitor will gain a lot by reading the two problems together, instead of reading one and not realising that there is separate post for other very closely related problem. Even if he/she realizes, he will have to figure out what is difference between two by comparing them. Btw, I have added updates which may make theorem, proof and remark irrelevant, though not sure and hence I havent removed them. – Rnj Mar 16 '21 at 15:14

0 Answers0