Should I model this problem as a POMDP?

Question

Suppose we have a finite-horizon sequential decision-making problem. At period $t$ we are in state $s$. We take action $a$ and we receive reward $r$ and go to state $s-1$ at period $t+1$. However, it is possible with a positive probability ($p>0$) that after $\tau$ periods ($\tau$ is a realization of a random variable), we find that we have made a mistake such that at period $t$, we would stay at state $s$ at period $t+1$, and there is no reward, i.e., $r=0$ after taking action at period $t$. So, in period $t+1$, we will stay at period $s$, not $s-1$.

To model this problem, I was thinking that we should model it as a POMDP problem because we are not sure about the exact state of the problem at each period. For example, based on the discussion, we are in state $s-1$ with probability $p$, and in state $s$ with probability $1-p$. I am wondering how to model the reward in this case, because the reward is not also exact. Should I model it as a POMDP with delay?

I would be thankful if you could give me your ideas on how to model it and some related references.

Should I model this problem as a POMDP?

0 Answers0