5

I want to use Reinforcement Learning to optimize the distribution of energy for a peak shaving problem given by a thermodynamical simulation. However, I am not sure how to proceed as the action space is the only thing that really matters, in this sense:

  • The action space is a $288 \times 66$ matrix of real numbers between $0$ and $1$. The output of the simulation and therefore my reward depend solely on the distribution of this matrix.

  • The state space is therefore absent, as the only thing that matters is the matrix on which I have total control. At this stage of the simulation, no other variables are taken into consideration.

I am not sure if this problem falls into the tabular RL or it requires approximation. In this case, I was thinking about using a policy gradient algorithm for figuring out the best distribution of the $288 \times 66$ matrix. However, I do not know how to behave with the "absence" of the state space. Instead of a tuple $\langle s,a,r,s' \rangle$, I would just have $\langle a, r \rangle$, is this even an RL-approachable problem? If not, how can I reshape it to make it solvable with RL techniques?

nbro
  • 39,006
  • 12
  • 98
  • 176
FS93
  • 145
  • 6

1 Answers1

3

A stateless RL problem can be reduced to a Multiarmed Bandit (MAB) problem. In such a scenario, taking an action will not change the state of the agent.

So, this is the setting of a conventional MAB problem: at each time step, the agent selects an action to either perform an exploration or exploitation move. It then records the reward of the taken action and updates its estimation/expectation of the usefulness of the action. Then, repeats the procedure (selection, observing, updating).

To chose between exploration and exploitation moves, MAB agents adopt a strategy. The simplest one would probably be $\epsilon$-greedy which agent chooses the most rewarding actions most of the time (1-$\epsilon$ probability) or randomly selects an action ($\epsilon$ probability).

Borhan Kazimipour
  • 866
  • 1
  • 10
  • 20
  • This makes sense. However, in the MAB scenario, you have a discrete action-space and the exploration/exploitation dynamics is much clearer and is normally approached (to my knowledge) with simple tabular methods. In my case, I have a very large matrix to be filled with real values, which would definitely require an approximation (a NN?) somewhere. Plus, considering the exploration dilemma: should I reshape the action-space to force the agent to change one entry of the matrix at the time (or one column at the time)? Otherwise, I do not see where and how exploration can happen in my case. – FS93 Jun 14 '19 at 14:56