I am considering using Reinforcement Learning to do optimal control of a complex process that is controlled by two parameters
$(n_O, n_I), \quad n_I = 1,2,3,\dots, M_I, n_O = 1,2,3,\dots, M_O$
In this sense, the state of the system is represented $S_t = (n_{O,t}, n_{I,t})$. It is represented, because there is a relatively complex system, a solution of coupled Partial Differential Equations (PDES), actually in the background.
Is this problem considered a partially observable Markov Decision Process (POMDP) because there is a whole mess of things behind $S_t = (n_{O,t}, n_{I,t})$?
The reward function has two parameters
$r(s) = (n_{lt}, \epsilon_\infty)$
that are results of the environment (solution of the PDEs).
In a sense, using $S_t = (n_{O,t}, n_{I,t})$ makes this problem similar to Gridworld, where the goal is to go from $S_0 = (M_O, M_I)$ to a state with smaller $(n_O, n_I)$, given reward $r$, where the reward changes from state to state and episode to episode.
Available action operations are
$inc(n) = n + 1$
$dec(n) = n - 1$
$id(n) = n$
where $n$ can be $n_I$ or $n_O$. This means there are $9$ possible actions
$A=\{(inc(n_O), inc(n_I)),(inc(n_O), dec(n_I)),(inc(n_O), id(n_I)),(dec(n_O), inc(n_I)), \dots\}$
to be taken, but there is no model for the state transition, and the state transition is extremely costly.
Intuitively, as solving a kinematic equation for a point in space, solving coupled PDEs from fluid dynamics should have the Markov property (strongly if the flow is laminar, for turbulence, I have no idea). I've also found a handful of papers where a fluid dynamics problem is parameterized and a policy-gradient method is simply applied.
I was thinking to use REINFORCE as a start, but the fact that $(n_O, n_I)$ does not fully describe the state and questions like this one on POMDP and this one about simulations make me suspicious. Could REINFORCE be used for such a problem, or is there something that prevents this?