I am solving an RL MDP problem which is model based. I have an MDP which has four possible states S1-S4 and four different actions A1-A4, with S4 being terminal state and S1 is the beginning state. There is equal probability of applying any of the available actions on S1. The goal of my problem is to get from S1 to S4 with the maximum possible reward. I have two questions in this regard -
Will this be a valid MDP model if I have this rule for my model - If i perform action a1 on S1 and next state is S2 then the set of available valid actions on S2 will be only a2,a3 and a4. If i apply any of the actions a2,a3 or a4 on S2 and it takes me to S3 then now i am left with a set of two valid actions for S3(except the one that was taken on s2). Can i do this? Because in my problem an action once taken does not require to be taken again later.
I am confused about finding the optimal value and policy for my MDP. Both the optimal value and policy functions are not known to me. The objective is to find the optimal policy for my MDP which would get me from S1 to S4 with max reward. Any action taken on a particular state can lead to any other states (i.e there is a uniform equal state transition probability of 25% in my case for all states except S4 since it's terminal state). How can i approach this problem? After a lot of google search I vaguely understood that I must start with choosing a random policy (equal probability of taking any valid action) -> find value functions for each state -> iteratively compute V until it converges and then from these value functions I need to compute the optimal policy. Those solutions mostly use the Bellman Equations. Can someone please elaborate on how i can do this? Or if there is any other method to do it?
Thank you in advance