Bellman optimality equation in semi Markov decision process

Question

I wrote a Python program for a simple inventory control problem where decision epochs are equally divided (every morning) and there is no lead time for orders (the time between submitting an order until receiving the order). I use the Bellman equations, and solve them by policy iteration (Dynamic Programming).

Now I want to consider lead time (between 1 to 3 days with equal probabilities). As I understand, the problem is defined by Semi Markov Decision Process for considering sojourn time in each state. I am confused about the Bellman equations in this scenario because we don't know exactly when the order will be received and is it necessary to discount the reward for day two or three?

Neil Slater · Accepted Answer · 2020-09-02T12:27:20.617

1

The core problem here is state representation, not estimating return due to delayed response to actions on the original state representation (which is no longer complete for the new problem). If you fix that, then you can solve your problem as a normal MDP, and base calculations on single timesteps. This allows you to continue using dynamic programming to solve it, provided the state space remains small enough.

What needs to change is the state representation and state transitions. Instead of orders resulting in immediate change of stock levels, they become pending changes, and for each item you will have state representation for the amount of current stock, plus amount of stock in each lead time category. State transitions will modify expected lead time for each amount of pending stock as well as amount of current stock.

Your lead time categories will depend on whether the agent knows the lead time immediately after making an order:

If lead times are known, track remaining time until items arrive 1,2 or 3 days. These categories will be assigned by the enviroment following the order, then lead time will transition down on each day deterministically. A 1 day lead time will transition to in stock, 2 day lead will transition to 1 day etc.
If lead times are not known, but probabilities of them are, track time since the order was made. This will be 0, 1 or 2 days. Although you don't know when an order will arrive, you know the probabilities for state transition - e.g. items in 0 days have a 1 in 3 chance of transitioning to "in stock" and a 2 in 3 chance of transitioning to 1 days.

This makes the state space larger, but is less complex than moving to the Semi MDP representation. For instance, doing it this way means that you can still work with single time step transitions and apply dynamic programming in a standard way.

In general, if the environment has a delayed response to actions, then the best way to maintain Markov trait is to add relevant history of actions taken to the state. The added state variables can either be a direct list of the relevant actions, or something that tracks the logical consequence of those actions.

edited Sep 02 '20 at 12:27

answered Sep 02 '20 at 09:53

Neil Slater

28,678
3
38
60

Dear Neil Slater Thank you for the answer. As I understand it, I must calc the return for 0 to 2 days lead time with equal probability (1 to 3 chance) and demand with Poisson distribution with lambda parameter as follows: ( if the items arrive on day 0 with probability 0.33 * the return value) + (if the items arrive on day 1 with probability 0.33 * (the return value that calculate based on combination of demand size for day 0 and 1 (e.g.the probability of 0 demand in day 0 and 1 demand in day 1 and so on) * the return) + ... Is it correct? Thank you in advance. – mnbobo Sep 02 '20 at 11:43
@mnbobo: No that should not be necessary. You can stick with single time step evaluations. It is the state representation that needs to change, to properly represent the in progress orders. You will still need to model the transition probabilities on a time step by time step basis. – Neil Slater Sep 02 '20 at 12:12
@mnbobo I have tried to clarify that in the answer. There is no need to calculate return across multiple time steps. That is a separate approach that you could consider for e.g. TD($\lambda$). But you cannot realistically do that directly based on the current action alone, because the agent is free to take other actions during those other days and those actions will further affect expected return. The core problem here is state representation, not estimating return due to delayed response. – Neil Slater Sep 02 '20 at 12:20
Dear @Neil Slater Thanks for your reply. If we have holding cost and shortage cost, then we have some mini rewards between time epochs due to the demand arrival. For example if the lead times will be 2 days, we have holding cost or shortage cost with some probabilities on day 0 and 1. In this case, is it still unnecessary to consider these mini rewards? – mnbobo Sep 02 '20 at 14:37
@mnbobo The rewards will be as you define them for the problem. If there's a shortage based on current inventory, then there's a still shortage if an order is on the way. The default for extending your problem would be to have same rewards and costs as before. However you are in control and if you also want to have a more sophisticated model for customer behaviour you can. Typically in a toy stock model you just treat requests as random and short term, so apply rewards/costs purely on ability to immediately service them. In which case there would be no need to model extra "mini" rewards – Neil Slater Sep 02 '20 at 14:42
@mnbobo: Your question did not mention any change to goals (presumably of maintaining sufficient stock to furnish requests but not too much due to costs of holding it). Nor did it mention that you wanted to make handling item requests more sophisticated. So I would suggest not changing rewards at all. Just change state (and state progression rules) to model the order delay process. – Neil Slater Sep 02 '20 at 14:44
I really appreciate your help and your quick response. I will do as you said. But just out of curiosity, If the problem has holding cost and shortage cost, then expected return is calculated by following expression and considering reward and mini rewards: ( if the items arrive on day 0 with probability 0.33 * the return value) + (if the items arrive on day 1 with probability 0.33 * (the return value that calculate based on combination of demand size for day 0 and 1 (e.g.the probability of 0 demand in day 0 and 1 demand in day 1 and so on) * the return) + ... Is it true? – mnbobo Sep 02 '20 at 16:11
@mnbobo: That might be reasonable (I would need to understand your problem better to say for sure), but expected return could be more complicated than that. The point of using dynamic programming is that you don't need to figure out an analytical expected return. – Neil Slater Sep 03 '20 at 06:44

Bellman optimality equation in semi Markov decision process

1 Answers1