2

I have a certain scheduling problem and I would like to know in general whether I can use Reinforcement learning (and if so what kind of RL) to solve it. Basically my problem is a mixed-integer linear optimization problem. I have a building with an electric heating device that converts electricity into heat. So the action vector (decision variable) is $x(t)$ which quantifies the electrical power of the heating device. The device has to take one decision for every minute of the day (so in total there are $24$ hours $\times 60$ minutes $= 1440$ variables). Each of those variables is a continuous variable and can have any value between $0$ and $2000 W$.

The state space contains several continuous variables:

  • External varying electricity price per minute: Between $0$ Cents and $100$ Cents per kWh (amount of energy)
  • Internal temperature of the building: Basically between every possible value but there is a constraint to have the temperature between $20 °C$ and $22 °C$
  • Heat demand of the building: Any value between $0 W$ and $10.000 W$
  • Varying "efficiency" of the electrical heating device between $1$ and $4$ (depending on the external outside temperature)

The goal is to minimize the electricity costs (under a flexible electricity tariff) and to not violate the temperature constraint of the building. As stated before, this problem can be solved by mathematical optimization (mixed-integer linear program). But I would like to know if you can solve this also with reinforcement learning? As I am new to reinforcement learning I would not know how to do this. And I have some concerns about this.

Here I have a very large state space with continuous values. So I can't build a comprehensive $Q-$table as there are to many values. Further, I am not sure whether the problem is a dynamic programming problem (as most/all?) of the reinforcement problems. From an optimization point of view it is a mixed-integer linear problem.

Can anyone tell me if and how I could solve this by using RL? If it is possible I would like to know which type of RL method is suitable for this. Maybe Deep-Q-Learning but also some Monte-Carlo policy iteration or SARSA? Shall I use model-free or model-based RL for this?

Reminder: Does nobody know whether and how I can use reinforcement learning for this problem? I'd highly appreciate every comment.

Can nobody give me some more information on my issue? I'll highly appreciate every comment and would be quite thankful for more insights and your help.

PeterBe
  • 212
  • 1
  • 11
  • Is the output you need an advanced and fixed *plan* of all decisions for the day, or can decisions be made at each minute dynamically in response to state values at each time $t$? – Neil Slater Aug 04 '21 at 08:36
  • 1
    Hi Neil, thanks for your comment. Basically both options would be possible. The output could be a fixed plan (schedule) of all decisions using some predictions of external variables (like the outside temperature). But the output can also be made dynamically at each timeslot t based on the current status. – PeterBe Aug 04 '21 at 10:06
  • @NeilSlater: Hi Neil, it is possible with RL to only generate the actions for 1 timeslot or is it also possible to generate a schedule (=action for multiple time slots in the future)? – PeterBe Aug 05 '21 at 09:33
  • 1
    It is possible for an action to be a schedule for multiple time steps, but it is more natural in RL to consider one time step at a time. – Neil Slater Aug 05 '21 at 09:38

1 Answers1

2

Details matter, and it is possible that your problem is best solved using classic control (solving the state equations) or operations research style optimisation. However, RL is also a good choice because it can be made to learn a controller that is not brittle when things go wrong.

One thing you will have to accept with RL is that the constraints will be soft constraints, even if you penalise them heavily. That is, you can expect that the internal temperature could drift outside of bounds. It definitely will during learning. A major design concern when framing the problem for reinforcement learning is how to weight the different rewards that represent your goals. You can weight your strict constraints higher, but at least initially they need to be low enough that the cost saving is not completely swamped.

I would suggest that your worst constraint failure penalty is slightly larger than the highest possible electricity cost for a single time step. That would mean the agent is always incentivised to spend money if it has to, as opposed to break the constraints, whilst still being able to explore what happens when it does break the constraints without having to cope with predicting large numbers.

There are lots of types of RL. Some are better at different kinds of problems. I would characterise your problem as you have described it as:

  • Episodic - but only for convenience of describing the problem. In fact your agent with a 24 hour episode will be incentivised to allow internal temperature to drop at the end of the 24 hours to save money, when it does not care what might happen immediately afterwards. Depending on price of electricity at that point, it could well be more optimal to spend more. This may only be a small difference from truly optimal behaviour, but you might play to strong point of RL by re-framing the problem as a continuing one (where mixed-integer linear optimisation may be harder to frame).

  • Continuous state space, with low dimensionality.

    • If prices are known in advance, you may want to augment the state space so that the agent knows how long it has at current price and whether the next price will be higher or lower. Alternatively, if they always follow the same time schedule, you could add the current time as a state variable. Either way, that allows the agent to take advantage of the temperature bounds. For instance, it could load up on cheap heating before a price hike, or allow the temperature to drop to minimum acceptable if cheaper electricity is about to become available.
  • Large, possibly continuous action space. You might want to consider approximating this to e.g. 21 actions (0 W, 100 W . . . 2000 W) as optimising this simpler variant will be easier to code (a DQN could do it), whilst it may not significantly affect optimality of any solution.

I don't think you could simplify your state space in order to use Q tables. So the DQN agent is probably the simplest that you could use here, provided you are willing to discretise the action space.

If you don't want to discretise the action space, then you will want to use some form of policy gradient approach. This will include a policy network that takes current state as input and then output a distribution of power level choices - e.g. a mean and standard deviation for the power choice. In production use you may be able to set the standard deviation to zero and use the mean as the action choice. A method like A3C can be used to do train such an agent.

I suggest that you discretise the actions and use a DQN-based agent to learn an approximate optimal policy for your environment. If that returns a promising result, you could either stop there or try to refine it further using continuous action space and A3C.

Also, you will want to practice using DQN on a simpler problem before diving in to your main project. Two reasonable learning problems here might be Cartpole-v1 and LunarLander-v2 which also has a continuous actions variant. Learning enough about setting up relevant RL methods to solve these toy problems should put you on a good footing to handle your more complex problem.

Keras documentation includes an example DQN for Atari Breakout, that you may be able to use as the basis for building your own code.

Neil Slater
  • 28,678
  • 3
  • 38
  • 60
  • Thanks Neil Slater for your answer. Regarding the "Episodic ": Basically I also have a constraint at the very end of a day that the temperature should be (almost) the same as the initial temperature (which is set to 21 °C). Regarding the action space: I will discretize the action space such that I can use DQN. About the constraint violation: Basically the RL-controller should be trained and later it will used with a secondary controller that can overrule the RL-controller if a constraint violation is about to happen. – PeterBe Aug 05 '21 at 09:28
  • What framework or library would you advice me to use for DQN? Maybe Keras? Further, do you know a good tutorial for a framework that I can use and then adjust to my problem? I am honestly not such a big fan of these toy examples as they are quite different from what I inted to do and I also have to spent quite much time to learn other frameworks (like the gym) that I will not need at all. This is why I would like to directly start with my problem. – PeterBe Aug 05 '21 at 09:32
  • 1
    Keras is fine to build a DQN with. I would still advise you to study the toy problems, as per the answer, and I have picked three that are quite similar to your problem - continuous state space and simulations of physical environments. – Neil Slater Aug 05 '21 at 09:35
  • Thanks a lot Neil for your great answer. I really appreciate it. I upvoted and accepted it. So I will then try to understand the Atari example and later try to implement my problem. – PeterBe Aug 05 '21 at 09:47
  • When I have learned the Q-function would you advice me to also use policy iteration or value iteration after the Q-learning (or in combination)? Or shall I use the SARSA-algorithm? – PeterBe Aug 05 '21 at 11:27
  • 1
    @PeterBe: DQN also learns an implied policy $\pi(s) = \text{argmax}_a \hat{q}(s,a, \theta)$, so when you have learned an (approximate) optimal action value function, you also have your (approximate) optimal policy. The other algorithms you suggest won't help you. SARSA is an alternative to Q learning that may work, but I suggest you stick to DQN for discretised actions. There are extensions to DQN that may help, although difficult to say much in advance. Use A3C, PPO or DDPG if you want to explore continuous action space. – Neil Slater Aug 05 '21 at 11:41
  • Thanks for the answer Neil and your effort. I really appreciate it. What I am wondering is why DQN needs a discrete action space? As far as I understood, DQN just uses an articial neural network (e.g. a multi-layer perceptron or a LSTM) to learn the Q-function. Normally these ANNs can also cope with continuous input and outputs. – PeterBe Aug 05 '21 at 12:19
  • @PeterBe: It is because to derive the policy from the neural network you have to perform $\text{argmax}_a \hat{q}(s,a,\theta)$ (where $\hat{q}$ is the neural network function and $\theta$ is its learned parameters). The more actions there are, the longer this will take to do precisely. In a continuous action space it is intractable without some kind of approximation of the space or the $\text{argmax}$. – Neil Slater Aug 05 '21 at 18:43
  • Thanks Neil for your answer. Basically I do not really understand your last comment. As far as I know ANN are also often used to map continouous inputs to continouous outputs. Or am I wrong on this one? Of course discrete inputs and outputs are easier to handle, but still I would say that it should also be possible to continous variables. – PeterBe Aug 06 '21 at 06:30
  • 1
    @PeterBe I suggest you ask a separate quiestion about your last issue because it is quite different to your original problem and this comment thread has gone on far too long. – Neil Slater Aug 06 '21 at 06:33
  • Hi, it's me again with 2 follow up questions to your answers. Meanwhile I learned about one of the examples you suggested with gym (I do not want to learn the atari example as it is quite far away from my use case). 1) I have to implement my own environment. Can I use gym for that or can I just set up some simulation of a building in Python without any reference to gym (I would prefer not using gym) 2) Eventually my agent has to make 2 non-independant decisions at every time slot (I did not mentione this in my initial question). Is it possible to do this with RL learning? – PeterBe Aug 19 '21 at 12:55
  • 1
    @PeterBe 1)You can do either, you will find that you end up writing an environment very similar to Gym ones if you write your own, but it is not a problem. 2) Yes, you can have compound actions. For more/better details please ask a new question on the site. Comments are for clarifying details on the answer, and I cannot give proper details here – Neil Slater Aug 19 '21 at 13:04
  • Hi Neil, I have a question to the constraint part of your answer "I would suggest that your worst constraint failure penalty is slightly larger than the highest possible electricity cost for a single time step. That would mean the agent is always incentivised to spend money if it has to, as opposed to break the constraints, whilst still being able to explore what happens when it does break the constraints without having to cope with predicting large numbers." – PeterBe Aug 30 '21 at 13:21
  • Basically I have 3 goals with normalized rewards between 0 and 1 for every timeslot and I have 10 constraints. Should the constraints' reward also be normalized for all 10 constraints and then I should choose a higher weight for the most important constraint than for all 3 goals combined? Is it also possible to tell the agent some rules directly without any constraints? E.g. I have 2 storage systems and the agent is only allowed to heat up 1 for every timeslot. Further the agent should not start and stop heating up quite frequently but just have e.g. 4 starts of the device a day – PeterBe Aug 30 '21 at 13:22
  • 1
    @PeterBe Sorry this comment thread has gone on far too long. Please ask follow-ups as a new question. You may also benefit from another point of view than mine. You could link this question and answer from your new question, but do please make it as self-contained as you can. – Neil Slater Aug 30 '21 at 15:01
  • Hi Neil, it's me again coming back to my follow-up question about your suggested approach regarding the weights of the constraints and the rewards. I ask a separate question as you advised (https://ai.stackexchange.com/questions/30469/how-to-weigt-constraints-in-a-control-problem-with-reinforcement-learning) but unfortunately nobody has answered during more than 3 weeks altough having set reminders mutiple times. This is why I'll highly appreciate any further comment on your suggested approach. – PeterBe Sep 23 '21 at 08:28