Why can the core reinforcement learning algorithms be applied to POMDPs?

Question

Why can an AI, like AlphaStar, work in StarCraft, although the environment is only partially observable? As far as I know, there are no theoretical results on RL in the POMDP environment, but it appears the core RL techniques are being used in partially observable domains.

See this related but more specific question: [Can Q-learning be used in a POMDP?](https://ai.stackexchange.com/q/11612/2444). — nbro, Apr 03 '20 at 20:30
It's not even clear if SC2 can be represented as any type of markov model, observable or not but the system still works. — FourierFlux, Apr 03 '20 at 20:35
I am not familiar with AlphaStar or StarCraft to explain the success of AlphaStar, but this is not the first time that a certain assumption doesn't probably hold in the real-world problem, but the applied method (that assumes that such an assumption holds) still performs decently or even well. For example, if I recall correctly, naive Bayes makes some assumptions that are just unrealistic, but, in practice, it still works in many cases. Why does it work? I don't know because I am not an expert on naive Bayes and the problems it's been applied to. — nbro, Apr 03 '20 at 20:39
However, I can say that, in many cases, people try to make the environment an MDP (or they try to make the Markov property hold) by doing some tricks. For example, a typical trick in RL is to combine different successive frames of a video (or video game) in order to build a state, rather than using only one frame as the state. — nbro, Apr 03 '20 at 20:40
Also, note that POMDPs are not exactly what you are thinking of. In POMDP, the agent doesn't know the state it is in, so it maintains a "belief" (i.e. a probability distribution) over the possible states. But your question is still very interesting and it is definitely legitimate! — nbro, Apr 03 '20 at 20:52
Practically speaking, the agent doesn't know its state. The true state of the environment would display every piece but only areas within a radius of its units can it see. Based upon prior observations it could develop a belief about possible unit positions..... — FourierFlux, Apr 03 '20 at 21:04
Yes, maybe you're right. Even if you combine multiple frames, that may still not represent a state. It could just be an approximation of the true state. — nbro, Apr 03 '20 at 21:07

Why can the core reinforcement learning algorithms be applied to POMDPs?

0 Answers0