How to assign rewards in a non-Markovian environment?

Question

I am quite new to the Reinforcement Learning domain and I am curious about something. It seems to be the case that the majority of current research assumes Markovian environments, that is, future states of the process depend only upon the present state, not on the sequence of events that preceded it. I was curious about how we can assign rewards when the Markovian property doesn't hold anymore. Do the state-of-the-art RL theory and research support this?

John Doucette · Accepted Answer · 2019-11-01T15:10:22.147

Dealing with a Non-Markovian process is unusual in Reinforcement Learning. Although some explicit attempts have been made, the most common approach when confronted with a non-Markovian environment is to try and make the agent's representation of it Markovian.

After reducing Agent's model of the dynamics to a Markovian process, rewards are assigned from the environment in exactly the same way as before. The environment simply sends the agent a reward signal in response to each action.

The Markovian assumption is essentially a formalism of the idea that the future can be predicted from the present. It says that if you know the dynamics of a system, and you know the state of the system now, you know everything you need to predict the state of the system later, and how we got to this state is not important. Formally, we write this as $P(s_t∣s_{t−1:0})=P(s_t∣s_{t−1})$.

That said, the models we use in AI are usually simplifications of the real world. When we simplify the world, we can introduce non-Markovian dynamics. However, if the model grows too complex, and the state space too large, learning will take too long. The goal is then to define a state space that is small enough to be learnable, and not too bad an approximation of the real dynamics of the world. AI researchers have several tools to do this.

As a working example, imagine that the future position of a robot depends mainly on the current position, and current velocity, along with the action the robot takes right now. Using these variables to define a state, we get almost Markovian dynamics, but as the robot moves over time, its battery drains and the movements become very slightly more imprecise. If we wanted to remove this error, we can:

Expand the state variables. If we add "current battery level" to the state, then our process will become Markovian again. This is the most obvious approach, but it only works up to a point. The state space grows exponentially in size as you add new variables. It will quickly become too large to learn, so in many problems, a complex state is subsequently decomposed into simpler sub-states. These simpler states may not depend on one another at all, or may depend only to varying degrees. The main limiting factor in learning to navigate an exponential state space is that the number of parameters in our model will grow exponentially. If the original space had $O(2^n)$ parameters, splitting it in half will yield two seperate learning problems of size $O(2^{n/2} + 2^{n/2} = 2^{n/2 + 1})$. Splitting the problem in two reduces its size by a factor $O(2^{n/2})$, which is big. This is the technique exploited by Dynamic Bayesian Networks. In the case of our robot, we could make current $x$ position depend only on previous $x$ position, previous $x$ velocity, and the robot's action, rather than on all previous variables, and likewise for the other variables.
Use Higher Order Models of the Environment. This is really a generalization of (1) above. Instead of the state being the current location and velocity of the robot, we could define the state to be the current location/velocity and the location/velocity of the robot 1 step in the past (or 2, or 3, or $k$ steps). This increases the size of the state space enormously, but if we don't know why there is an error in the robot's movements, this can still allow us to model it. In this case, if we know the size of the change in the robot's position last time, and we observe that it is smaller this time (for the same action), we can estimate the rate at which the change is changing, without understanding its cause.

As an example, consider the process of setting the price of a good. The agent's reward is non-Markovian, because sales increase or decline gradually in response to price changes. However, they don't depend on all of the history of prices. Imagine that instead they depend on the last 5 prices (or the last k). We can use technique (2) above to expand the agent's model of what a state is. The now the agent learns that when prices have been $p_{t-1:t-5}$ in the last 5 steps, and it sets the price at time $t$ to some value, it's reward is $x$. Since the reward depends only on the prices now, and in the last 5 steps, the agent is now learning a Markovian process, even though the original process is non-Markovian, and the reward function is non-Markovian. No changes are made to the reward function, or the environment, only the agent's model of the environment.

"The Markovian assumption is essentially a formalism of the idea that the future can be predicted from the past.", maybe you meant "from the present" (rather than "past")? — nbro, Nov 01 '19 at 00:09
Well, apart from that, I think you are providing some imprecise and even misleading sentences/information. For example, "Apart from Quantum-level events, the world is Markovian in nature. The future tends to resemble the past.". I don't think that the world is Markovian, in an intuitive sense, because, often, the present is insufficient to predict the future. The Markov property is related to the concept of probability and how the future is independent of the past given the present. In the context e.g. of RL, it states that $P(s_t \mid s_{t-1:0}, a_{t-1:0}) = P(s_t \mid s_{t-1}, a_{t-1})$. — nbro, Nov 01 '19 at 00:55
@nbro Hmm. I am bleeding together the assumption of a Stationary process with the assumption of a Markovian process, and unfortunately the two are very tightly linked in my mind. I've separated out the content related to stationary processes. Thanks for reading this over and pointing this out. — John Doucette, Nov 01 '19 at 01:12
I think that now the introduction is correct. I roughly agree with your points 1 and 3, but the second point is a bit unclear (at least to me, but maybe it is because I am not super familiar with a problem like that one). Anyway, even though now your answer seems ok, I don't think you're really addressing the original question, in the sense that you're not even talking about rewards. — nbro, Nov 01 '19 at 01:23
thank you for the extensive answer as @nbro pointed the idea of reward was not touched. If we simplify a non-markovian environment to High order model of the environment then how can we assign rewards? I suppose a follow up question then becomes how can credit be assigned to each action if we look **k** steps in the past? — thulungair, Nov 01 '19 at 09:04
@thulungair: The point is that you generally don't touch the reward system at all. The reward system represents your high-level goals for the problem, and that does not change if yoiu discover that some part of it is non-Markovian. E.g. the goal of winning a game of poker is not different whether you know or don't know the distribution of cards. Instead, you change or augment the state representation. I was writing a similar answer yesterday (still in draft), but John's first paragraph here captures the essence of it exactly. — Neil Slater, Nov 01 '19 at 09:25
First para from my answer: "There are actually several *different* ways that the Markov property may not hold, and different approaches to deal with each one. None of them change the way reward is assigned." Start of secod para: "In general the approach is to adjust the state representation, or create some derivation of it, that *does* have the Markov trait," — Neil Slater, Nov 01 '19 at 09:26
The rest is too incomplete to post, and John's answer here overlaps with my view too much for me to spend more time on it. — Neil Slater, Nov 01 '19 at 09:27
Ah I see, that gives me some more intuition on how to think about rewards. However, things are still not clear regarding credit will be assigned. Maybe if I give a concrete example then perhaps it shows what my confusion is. — thulungair, Nov 01 '19 at 09:46
Say an agent is exploring individual prices for product at each time. It can increase, decrease or keep the same price. The reward at current state is simply the Revenue = Demand * Price. Since demand is dependent on the current price and historical prices, one cannot just simplify by saying that current reward is only dependent on the previous price.To simplify, assume reward is dependent on **k** summed past reward but now there is no clear credit assignment to what action resulted in highest reward. How would one assign reward in this case? — thulungair, Nov 01 '19 at 10:03
@thulungair I've added some new text to clarify this, including an explanation of how your example is addressed, at the end. — John Doucette, Nov 01 '19 at 15:10
Thank you @johndoucette that makes sense. I guess it helps validate an intuition I had. I will mark your response as the answer. — thulungair, Nov 04 '19 at 09:18

How to assign rewards in a non-Markovian environment?

1 Answers1