It is possible to solve a problem with continuous action spaces and no states with reinforcement learning?

Question

I want to use Reinforcement Learning to optimize the distribution of energy for a peak shaving problem given by a thermodynamical simulation. However, I am not sure how to proceed as the action space is the only thing that really matters, in this sense:

The action space is a $288 \times 66$ matrix of real numbers between $0$ and $1$. The output of the simulation and therefore my reward depend solely on the distribution of this matrix.
The state space is therefore absent, as the only thing that matters is the matrix on which I have total control. At this stage of the simulation, no other variables are taken into consideration.

I am not sure if this problem falls into the tabular RL or it requires approximation. In this case, I was thinking about using a policy gradient algorithm for figuring out the best distribution of the $288 \times 66$ matrix. However, I do not know how to behave with the "absence" of the state space. Instead of a tuple $\langle s,a,r,s' \rangle$, I would just have $\langle a, r \rangle$, is this even an RL-approachable problem? If not, how can I reshape it to make it solvable with RL techniques?

score 3 · Answer 1 · answered Jun 14 '19 at 07:21

A stateless RL problem can be reduced to a Multiarmed Bandit (MAB) problem. In such a scenario, taking an action will not change the state of the agent.

So, this is the setting of a conventional MAB problem: at each time step, the agent selects an action to either perform an exploration or exploitation move. It then records the reward of the taken action and updates its estimation/expectation of the usefulness of the action. Then, repeats the procedure (selection, observing, updating).

To chose between exploration and exploitation moves, MAB agents adopt a strategy. The simplest one would probably be $\epsilon$-greedy which agent chooses the most rewarding actions most of the time (1-$\epsilon$ probability) or randomly selects an action ($\epsilon$ probability).

This makes sense. However, in the MAB scenario, you have a discrete action-space and the exploration/exploitation dynamics is much clearer and is normally approached (to my knowledge) with simple tabular methods. In my case, I have a very large matrix to be filled with real values, which would definitely require an approximation (a NN?) somewhere. Plus, considering the exploration dilemma: should I reshape the action-space to force the agent to change one entry of the matrix at the time (or one column at the time)? Otherwise, I do not see where and how exploration can happen in my case. — FS93, Jun 14 '19 at 14:56

It is possible to solve a problem with continuous action spaces and no states with reinforcement learning?

1 Answers1