How to compare different trajecories in a Markov Decision Process

Question

I realize that my question is a bit fuzzy and I am sorry for that. If needed, I will try to make it more rigorous and precice.

Let $\mathcal{M}$ be a Markov Decision Process, with state space $\mathcal{S}$ and action space $\mathcal{A}$. Let $\tau = (s_0, a_0, s_1, a_1, s_2, a_2, \dots)$ and $\tau' = (s_0', a_0', s_1', a_1', s_2', a_2', \dots)$ be two trajectories produced by an agent during two different episodes.

Question: Is there any standard way in the Reinforcement Learning literature to compare $\tau$ and $\tau'$? Ideally I am interested in finding a "distance" (it does not need to be a distance in the mathematical sense) $d(\tau, \tau ')$ such that it reflects the "distance" between the poicies that generated $\tau$ and $\tau'$.

For example it would be nice if $d(\tau, \tau')$ would be a good estimator of the KL divergence of $\pi$ and $\pi'$, where $\pi$ is the policy that generated $\tau$ and $\pi'$ the policy that generated $\tau'$.

in what regard do you need to compare two trajectories? maybe you want to compare the "agents" responsible for those two trajectories? then you can calculate the probability ratio of actions between two agents. it is actually used in "importance sampling" when doing off-policy optimization, measuring difference between behaviour and target policy — Alireza, Aug 25 '22 at 15:41
So, ideally I would like to compare the probabilities (policies) of the two agents, for example using the KL divergence (wich is the expectation w.r.t one of the two probabilities of the log of the ratio). However I do not have access to the policies but only to the trajectories. I would like to measure how different they are. — Onil90, Aug 25 '22 at 16:58
Do you only have two trajectories? Do you have access to the MDP model (i.e. transition probabilities)? — Neil Slater, Aug 26 '22 at 05:26
If the goal is to compare agents, then difference in states should not be in the compared vectors to begin with. I'd only compare $\pi(a|s)$ as this represents the agents. $p(s|a)$ part is an attribute of the environment. you can possibly learn $\pi(a|s)$ from data (not to mention how super hard it might be given the number actions or continuity of action space), then you can compare the learned $\pi$ densities, e.g. via KL. still, as I and also @Onil90 mentioned, the straight forward scenario is having access to transition probs. without that, don't know if there's a standard solution — Alireza, Aug 26 '22 at 10:39

How to compare different trajecories in a Markov Decision Process

0 Answers0