Should I model my problem as a semi-MDP?

Question

I have a system (like a bank) that people (customers) are entered into the systems by a Poisson process, so the time between the arrival of people (two consecutive customers) will be a random variable. The state of the problem is related to just the system (bank), and the action, made inside the system, can be e.g. offering the customer promotion or not (just based on the state of the system, not the status of customers).

To model the problem through RL, 1) it is possible to discretize time horizon into very short time interval (for example 5 minutes as a stage) such that in each time interval, just a single customer enter to our system. On the other hand, 2) it is possible that stages are defined as the time when a customer enters our system.

My questions are:

Is the second approach an semi-MDP (SMDP)? If I want to solve it with RL, should I use hierarchical RL?
In the first approach, if a customer enters in a time interval, it is easy to update the Q values. However, what should we do, if we are in state $S$ and take action $A$, but no customer enters our system, so we do not receive any reward for the pair of $(S, A)$? There would be no difference if we would take action $A_{1}$, $A_{2}$, and so on. This can happen for several consecutive time intervals. I think it is more challenging when we consider eligibility traces.

score 1 · Answer 1 · answered Feb 18 '19 at 16:34

To model the problem through RL,

it is possible to discretize time horizon into very short time interval (for example 5 minutes as a stage) such that in each time interval, just a single customer enter to our system.

On the other hand, it is possible that stages are defined as the time when a customer enters our system.

1) Is the second approach an SMDP and if I want to solve it with RL, I should use Hierarchical RL?

The second approach as you described it does not sound like an SMDP to me... it sounds to me like a regular MDP where the discrete time steps in the MDP do not have any meaningful connection anymore to real-world time in the real-world problem being modelled. This is the approach I would recommend taking if long periods of "inactivity" in between customers are irrelevant and/or if you do not expect your actions to have any influence on the duration of those periods of inactivity.

If you do expect your actions to have meaningful influence on the duration of inactivity, I would not recommend this approach. For example, your actions may make a customer angry and reduce their likelihood of quickly returning. Or your actions may indirectly affect the likelihood of other customers visiting due to interaction between your population of customers, they may or may not recommend your bank to their friends based on your actions.

2) In the first approach, if a customer enters in a time interval, it is easy to update $Q$-values. However, what should we do, if we are in state $S$ and take action $A$ but, no customer enter our system so we do not receive any reward for the pair of $(S, A)$ there would be no difference if we would take action $A_1$, $A_2$ and so on. This can happen for several consecutive time interval. I think it is more challenging when we consider eligibility trace.

I would actually argue that your first approach can be treated as an SMDP, and that it may be a good idea to do so (personally I prefer the Options framework, it's a bit more general than SMDPs, a bit more flexible, and a bit closer to "standard" RL terminology; there are plenty of free versions of that paper available on goole too).

Whenever you encounter a customer, you could select an option (or "macro-action") in the SMDP, which executes the one "real" action for your current customer once, and is subsequently forced to automatically select no-op actions (which do nothing) until the next customer arrives; that is the point in time where your previous option ends, and a new option can be selected.

An RL algorithm that operates on the SMDP / hierarchical level, i.e. one that only learns to optimize the selection of options / macro-actions and completely skips over all the forced no-op actions, would suffice. This is basically the approach described in Concurrent Reinforcement Learning from Customer Interactions.

Should I model my problem as a semi-MDP?

1 Answers1