Finding optimal Value function and Policy for an MDP

Question

I am solving an RL MDP problem which is model based. I have an MDP which has four possible states S1-S4 and four different actions A1-A4, with S4 being terminal state and S1 is the beginning state. There is equal probability of applying any of the available actions on S1. The goal of my problem is to get from S1 to S4 with the maximum possible reward. I have two questions in this regard -

Will this be a valid MDP model if I have this rule for my model - If i perform action a1 on S1 and next state is S2 then the set of available valid actions on S2 will be only a2,a3 and a4. If i apply any of the actions a2,a3 or a4 on S2 and it takes me to S3 then now i am left with a set of two valid actions for S3(except the one that was taken on s2). Can i do this? Because in my problem an action once taken does not require to be taken again later.
I am confused about finding the optimal value and policy for my MDP. Both the optimal value and policy functions are not known to me. The objective is to find the optimal policy for my MDP which would get me from S1 to S4 with max reward. Any action taken on a particular state can lead to any other states (i.e there is a uniform equal state transition probability of 25% in my case for all states except S4 since it's terminal state). How can i approach this problem? After a lot of google search I vaguely understood that I must start with choosing a random policy (equal probability of taking any valid action) -> find value functions for each state -> iteratively compute V until it converges and then from these value functions I need to compute the optimal policy. Those solutions mostly use the Bellman Equations. Can someone please elaborate on how i can do this? Or if there is any other method to do it?

Thank you in advance

You have jumped too far into the solution and are already getting some fundamental things wrong. For instance, if an action choice is restricted after being taken once, that information needs to be part of the state. So you *probably* don't have four states. I suggest that instead of diving straight in to your (maybe incorrect, and most definitley confusing) analysis of your problem as a MDP, you start with a description of the problem you are trying to solve, present your MDP analysis and get feedback on it. Once you have that, then it may be possible in a new question to look for solutions. — Neil Slater, Sep 04 '19 at 13:30
okay i will work on the MDP and put up a new question with the problem description and all. I could not understand this statement - "if an action choice is restricted after being taken once, that information needs to be part of the state". Could you please elaborate? I am sorry i am new to this field and trying to implement these algorithms in my domain .. — Bhavana, Sep 05 '19 at 04:18
Did you mean embed that information in the state transition probability matrix?? — Bhavana, Sep 05 '19 at 04:20
No, I mean that you don't just have states S1-S4, if one of the rules is that actions cannot be repeated. Even if the state carried no information at all except which actions were restricted, you would have at least 15 states. If the state S1-S4 is already something meaningful in your curent setup, then you have at least 60 states. If action *sequence* is important, then you will have more — Neil Slater, Sep 05 '19 at 06:35
Hi Neil, as per your suggestion i have included a more elaborate MDP analysis and opened a new question. here's the link - https://ai.stackexchange.com/questions/14296/can-someone-please-help-me-validate-my-mdp . Would be grateful if you could look at it and help me out ! Thank you ! — Bhavana, Sep 05 '19 at 15:12

Finding optimal Value function and Policy for an MDP

0 Answers0