How to choose an RL algorithm for a Gridworld that models a much more complex problem

Question

I am considering using Reinforcement Learning to do optimal control of a complex process that is controlled by two parameters

$(n_O, n_I), \quad n_I = 1,2,3,\dots, M_I, n_O = 1,2,3,\dots, M_O$

In this sense, the state of the system is represented $S_t = (n_{O,t}, n_{I,t})$. It is represented, because there is a relatively complex system, a solution of coupled Partial Differential Equations (PDES), actually in the background.

Is this problem considered a partially observable Markov Decision Process (POMDP) because there is a whole mess of things behind $S_t = (n_{O,t}, n_{I,t})$?

The reward function has two parameters

$r(s) = (n_{lt}, \epsilon_\infty)$

that are results of the environment (solution of the PDEs).

In a sense, using $S_t = (n_{O,t}, n_{I,t})$ makes this problem similar to Gridworld, where the goal is to go from $S_0 = (M_O, M_I)$ to a state with smaller $(n_O, n_I)$, given reward $r$, where the reward changes from state to state and episode to episode.

Available action operations are

$inc(n) = n + 1$

$dec(n) = n - 1$

$id(n) = n$

where $n$ can be $n_I$ or $n_O$. This means there are $9$ possible actions

$A=\{(inc(n_O), inc(n_I)),(inc(n_O), dec(n_I)),(inc(n_O), id(n_I)),(dec(n_O), inc(n_I)), \dots\}$

to be taken, but there is no model for the state transition, and the state transition is extremely costly.

Intuitively, as solving a kinematic equation for a point in space, solving coupled PDEs from fluid dynamics should have the Markov property (strongly if the flow is laminar, for turbulence, I have no idea). I've also found a handful of papers where a fluid dynamics problem is parameterized and a policy-gradient method is simply applied.

I was thinking to use REINFORCE as a start, but the fact that $(n_O, n_I)$ does not fully describe the state and questions like this one on POMDP and this one about simulations make me suspicious. Could REINFORCE be used for such a problem, or is there something that prevents this?

How to choose an RL algorithm for a Gridworld that models a much more complex problem

0 Answers0