I was going through paper titled "Algorithms for Inverse Reinforcement Learning" by Andrew Ng and Russell.
It states following basics:
MDP $M$ is a tuple $(S,A,\{P_{sa}\},\gamma,R)$, where
- $S$ is a finite seto of $N$ states
- $A=\{a_1,...,a_k\}$ is a set of $k$ actions
- $\{P_{sa}(.)\}$ are the transition probabilities upon taking action $a$ in state $s$.
- $R:S\rightarrow \mathbb{R}$ is a reinforcement function (I guess its what it is also called as reward function) For simplicity in exposition, we have written rewards as $R(s)$ rather than $R(s,a)$; the extension is trivial.
A policy is defined as any map $\pi : S \rightarrow A$
Bellman Equation for Value function $V^\pi(s)=R(s)+\gamma \sum_{s'}P_{s\pi(s)}(s')V^\pi(s')\quad\quad...(1)$
Bellman Equation for Q function $Q^\pi(s,a)=R(s)+\gamma \sum_{s'}P_{sa}(s')V^\pi(s')\quad\quad...(2)$
Bellman Optimality: The policy $\pi$ is optimal iff, for all $s\in S$, $\pi(s)\in \text{argmax}_{a\in A}Q^\pi(s,a)\quad\quad...(3)$
All these can be represented as vectors indexed by state, for which we adopt boldface notation $\pmb{P,R,V}$.
Inverse Reinforcement Learning is: given MDP $M=(S,A,P_{sa},\gamma,\pi)$, finding $R$ such that $\pi$ is an optimal policy for $M$
By renaming actions if necessary, we will assume without loss of generality that $\pi(s) = a_1$.
Paper then states following theorem, its proof and a related remark:
Theorem: Let a finite state space $S$, a set of actions $A=\{a_1,..., a_k\}$, transition probability matrices ${\pmb{P_a}}$, and a discount factor $\gamma \in (0, 1)$ be given. Then the policy $\pi$ given by $\pi(s) \equiv a_1$ is optimal iff, for all $a = a_2, ... , a_k$, the reward $\pmb{R}$ satisfies $$(\pmb{P}_{a_1}-\pmb{P}_a)(\pmb{I}-\gamma\pmb{P}_{a_1})^{-1}\pmb{R}\succcurlyeq 0 \quad\quad ...(4)$$ Proof:
Equation (1) can be rewritten as
$\pmb{V}^\pi=\pmb{R}+\gamma\pmb{P}_{a_1}\pmb{V}^\pi$
$\therefore\pmb{V}^\pi=(\pmb{I}-\gamma\pmb{P}_{a_1})^{-1}\pmb{R}\quad\quad ...(5)$
Putting equation $(2)$ into $(3)$, we see that $\pi$ is optimal iff
$\pi(s)\in \text{arg}\max_{a\in A}\sum_{s'}P_{sa}(s')V^\pi(s') \quad...\forall s\in S$
$\iff \sum_{s'}P_{sa_1}(s')V^\pi(s')\geq\sum_{s'}P_{sa}(s')V^\pi(s')\quad\quad\quad\forall s\in S,a\in A$ $\iff \pmb{P}_{a_1}\pmb{V}^\pi\succcurlyeq\pmb{P}_{a}\pmb{V}^\pi\quad\quad\quad\forall a\in A\text{\\} a_1 \quad\quad ...(6)$ $\iff\pmb{P}_{a_1} (\pmb{I}-\gamma\pmb{P}_{a_1})^{-1}\pmb{R}\succcurlyeq\pmb{P}_{a} (\pmb{I}-\gamma\pmb{P}_{a_1})^{-1}\pmb{R} \quad\quad \text{...from (5)}$
Hence proved.Remark: Using a very similar argument, it is easy to show (essentially by replacing all inequalities in the proof above with strict inequalities) that the condition $(\pmb{P}_{a_1}-\pmb{P}_a)(\pmb{I}-\gamma\pmb{P}_{a_1})^{-1}\pmb{R}\succ 0 $ is necessary and sufficient for $\pi\equiv a_1$ to be the unique optimal policy.
I dont know if above text from paper is relevant for what I want to prove, still I stated above text as a background.
I want to prove following:
- If we take $R : S → \mathbb{R}$—there need not exist $R$ such that $π^*$ is the unique optimal policy for $(S, A, T, R, γ)$
- If we take $R : S × A → \mathbb{R}$. Show that there must exist $R$ such that $π^*$ is the unique optimal policy for $(S, A, T, R, γ)$.
I guess point 1 follows directly from above theorem as it says "$\pi(s)$ is optimal iff ..." and not "unique optimal iff". Also, I feel it also follows from operator $\succcurlyeq$ in equation $(6)$. In addition, I feel its quite intuitive: if we have same reward for any given state for every action, then different policies choosing different actions will yield same reward from that state hence resulting in same value function.
I dont feel point 2 is correct. I guess, this directly follows from the remark above which requires additional condition to hold for $\pi$ to be "uinque optimal" and this condition wont hold if we simply define $R : S × A → \mathbb{R}$ instead of $R : S → \mathbb{R}$. Additionally, I feel, this condition will hold iff we had $=$ in equation $(3)$ instead of $\in$ (as this will replace all $\succcurlyeq$ with $\succ$ in the proof). Also this also follow directly from point 1 itself. That is we can still have same reward for all actions from given state despite defining reward as $R : S × A → \mathbb{R}$ instead of $R : S → \mathbb{R}$, which is the case with point 1.
Am I correct with the analysis in last two paragraphs?
Update
After some more thinking, I felt I was doing it all wrong. Also I feel the text from the paper which I specified is of not much help in proving these two points. So let me restate new intuition for proofs for the two points:
For $R: S\rightarrow \mathbb{R}$, if some state $S_1$ and next state $S_2$ has two actions between them, $S_1-a_1\rightarrow S_2$ and $S_1-a_2\rightarrow S_2$, and if optimal policy $π_1^*$ chooses $a_1$, then $π_2^*$ choosing $a_2$ will also be optimal, thus making NONE "uniquely" optimal since both $a_1$ and $a_2$ will yield same reward as reward is associated with $S_1$ instead of with $(S_1,a_x)$.
For $R: (S,A)\rightarrow\mathbb{R}$, we can assign large reward say $+∞$ to all actions specified in given $π^*$ and $-∞$ to all other actions. This reward assignment will make $π^*$ a unique optimal policy.
Are above logics correct and enough to prove given points?