Can the state transition function be dynamic in reinforcement learning?

Question

In general, there are two types of transition functions in reinforcement learning. Mathematically, they are as follows

#1: Stochastic state transition function:

$$T : S \times A \times S \rightarrow [0, 1]$$

#2: Deterministic state transition function:

$$T : S \times A \rightarrow S$$

Is it possible to make the transition function change as the game progress? Or is it impossible to assume such a transition function as it cannot qualify to be a function?

Or in other words, I may want to introduce something as follows:

#3: Dynamic state transition function:

$$T : S \times A \times X \rightarrow S$$

where $X$ is some continuous set.

Maybe you're looking for a non-stationary MDP? In that case, $T$ can change over time, i.e. it's non-stationary. If this answers your question, I will write it below. But maybe you should explain why you're asking this question. Why would $X$ be continuous? — nbro, May 26 '22 at 09:33
Yeah, @nbro I think that it is an exact answer. Please write it if possible. I formulated #3 just to make $T$ a function and the continuous assumption is to show some underlying randomness. I mean, the next state may change for a combination of the present state and action if episodes are repeated. — hanugm, May 26 '22 at 10:22
Are you able to specify what $X$ can depend on? Non-stationary MDPs usually requires that the dynamics of $X$ will depend solely on time, and if you expect to have any chance of learning them, also that progression of $T$ due to changing $X$ is relatively slow (or if radical then infrequent, thus slow when averaged over many episodes). Your generic description of $T$ could also be used to describe POMDPs - when $X$ depends on some unobservable data - and possibly other scenarios too, so it is worth clarifying. — Neil Slater, May 26 '22 at 15:42
@NeilSlater We can assume $X$ as continuous random vector obeying a probability distribution. — hanugm, May 26 '22 at 16:21
@hanugm: That's not dynamic, that's stochastic and no different to your first equation. The only way it could be dynamic is if $X$ was correlated to some data relevant to the environment (e.g. when the experiment was run). — Neil Slater, May 26 '22 at 16:30
@NeilSlater How is it the same as the first one as $X$ are $S$ are entirely different? — hanugm, May 26 '22 at 16:45
@hanugm: Actually I assumed in my last comment that $X$ is sampled once per time step. If it is sampled less frequently there would be different characterisations. But how often it is sampled, and how radical the changes are to $T$ - plus whether $X$ correlates with anything meaningful about the problem are critical pieces of information for any practical approach. In *general*, if you are not willing to define anything about $X$ or $T$, then your problem is not usefully an MDP and very little RL theory could be applied to it. — Neil Slater, May 26 '22 at 16:45
@NeilSlater So, if $X$ is independent or has an infinitesimal correlation to the environment states then can't I expect any RL algorithm to work on my environment? — hanugm, May 26 '22 at 16:48
@hanugm: No. It depends on how/when you are sampling $X$. If it is sampled every time step, then equation 1 and equation 3 describe the same thing - every combination of S, A has some probability of transitioning to the next state S'. — Neil Slater, May 26 '22 at 16:50
If you are doing something else with $X$, well it depends on precisely *what* you are doing. — Neil Slater, May 26 '22 at 16:51
@NeilSlater Suppose the environment states is for automatic driving car and $X$ is from the dataset of cat images. — hanugm, May 26 '22 at 16:51
OK, let's suppose that. When do you sample from $X$ (to get a specific cat picture $x$), and how does $T$ process the combination $(s, a, x)$ to produce the next car state? — Neil Slater, May 26 '22 at 16:53
@NeilSlater I edited the comment with a continuous one, chess is discrete. I use to map some cat image, which does not depend on the state of the actual environment and define a function. — hanugm, May 26 '22 at 16:56
I matched your edit :-) I don't think continuous vs discrete makes any difference here. — Neil Slater, May 26 '22 at 16:56
@NeilSlater ha. Theoretically, can't I define a custom mapping? For suppose, I pass $(s, a,x)$ to a neural network, whose output dimension is the same as the state space and pick the next $s'$ which has a minimum Euclidean distance with the output of the neural network. — hanugm, May 26 '22 at 16:58
Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/136605/discussion-between-neil-slater-and-hanugm). — Neil Slater, May 26 '22 at 17:43

Neil Slater · Answer 1 · 2022-05-26T21:35:22.990

Is it possible to make the transition function change as the game progress?

Yes, the normal way to do this would be to make the current time or time step $t$ part of the state $s$, so that equation 1 or 2 applies as normal. Similar applies if by "as the game progresses" that there is some other measure of progression such as how much the agent has scored so far, make that measure part of the state. Unless for some reason you do not want to allow the agent to track this extra pogression effect that impacts the behaviour - in which case you would typically consider the environment to be a partially observable MPD or POMDP.

Your equation 3 doesn't define a dynamic environment, but a parametric one. You have added the parameter $x$ to the normal MDP transition function. What that means depends critically on how the parametrisation of the transition function $T$ works in detail, and what you then do with the parameter $x$. Different modes of changing $x$ (plus different sensitivies to $x$ in the transtion function $T$) correspond to different classes of environment in the RL literature.

Here are some examples:

If $x$ is sampled from set $X$ once, and once only, your equation 3 reduces to equation 2. Sampling $x$ corresponds to selecting an environment to learn.
If $x$ is known/accessible to the agent, then regardless of how it is chosen, your true state is $s+x$ and equation 3 reduces to either equation 1 or equation 2 depending on how the $x$ part of $s+x$ transitions.
If $x$ is sampled independently from any other variable, on each and every time step, then equation 3 reduces to equation 1. Sampling $x$ in this way corresponds to choosing a random transition (from whatever is allowed by the nature of $T$) based on state and action.
If $x$ is sampled infrequently (less than once per episode) or evolves in a slow fashion (when combined with how T works, changes happen to effective transition rules less than once per episode), then you may have a non-stationary environment. This is typically addressed by continuously learning agents - a good example of an RL method written specifically to address this is DynaQ+.
If $x$ is sampled (or even fixed) at the start of the episode and evolves either independently or depending on state $s$ and action $a$ (plus any random changes), and the agent is not allowed to know $x$, then you have a POMDP. How solvable that is depends on how easy it is for the agent to allow for and learn what it can about $x$ - you may specifically need a POMDP solver. Knowing $X$ and the transition rules for $x$ may make a POMDP solver more feasible in practice.
- Instead of using $s$ and $x$, literature on POMDPs would typically use $s$ for state (equivalent to your $s, x$ concatenated) and $o$ for the agent's observation (equivalent to your $s$)

So, in summary, there isn't really a class of "dynamic" environments in RL studied under that name. Your idea might fit into one or more of several other well-studied concepts in RL instead.

Can the state transition function be dynamic in reinforcement learning?

1 Answers1