Is it possible to tell the Reinforcement Learning agent some rules directly without any constraints?

Question

I try to apply RL for a control problem, and I intend to either use Deep Q-Learning or SARSA.

I have two heating storage systems with one heating device, and the RL agent is only allowed to heat up 1 for every time slot. How can I do that?

I have two continuous variables $x(t)$ and $y(t)$, where $x(t)$ quantifies the degree of maximum power for heating up storage 1 and $y(t)$ quantifies the degree of maximum power for heating up storage 2.

Now, IF $x(t) > 0$, THEN $y(t)$ has to be $0$, and vice versa with $x(t)$ and $y(t)$ element $0$ or $[0.25, 1]$. How can I tell this to the agent?

One way would be to adjust the actions after the RL agent has decided about that with a separate control algorithm that overrules the actions of the RL agent. I am wondering if and how this can be also done directly? I'll appreciate every comment.

Update: Of course I could do this with a reward function. But is there not a direct way of doing this? Because this is actually a so called hard constraint. The agent is not allowed to violate this at all as this is technically not feasible. So it will be better to tell the agent directly not to do this (if that is possible).

Reminder: Can anyone tell me more about this issue? I'll highly appreciate any further comment and will be quite thankful for your help. I will also award a bounty for a good answer.

Hello @PeterBe, if I understand correctly, yes, an off-policy algorithm fit your problem. E.g. Q-Learning (but maybe a DDPG for the continuous case), SARSA is an off policy so you couldn't do that. Could you please tell me how you define your action space and state space? — Pulse9, Oct 20 '21 at 14:20
@Pulse9: Thanks for your comment. A simplified basic version of my problem is described here https://ai.stackexchange.com/questions/28888/reinforcement-learning-applicable-to-a-scheduling-problem . You wrote "SARSA is an off policy so you couldn't do that"? Why can I not use an off policy learning algorithm for my problem? Or do you just want to say I can't use an off-policy algorithm to tell the RL agent some rules directly without any reward function (but I still can use an off-policy algorithm for my general problem)? — PeterBe, Oct 20 '21 at 14:33

score 4 · Answer 1 · answered Oct 13 '21 at 20:38

4

You could just tweak your reward function to include this restrictions.

In the most simple case, you could reward your agent -1 if $x(t) > 0$ and $y(t) \neq 0$.

The scale of your negative reward depends on your general reward scaling of course.

answered Oct 13 '21 at 20:38

tnfru

348
1
12

Thanks tnfru for your answer. Of course I could do this with a reward function. But is there not direct way of doing this? Because this is actually a so called hard constraint. The agent is not allowed to violate this at all as this is technically not feasible. So it will be better to tell the agent directly not to do this (if that is possible) – PeterBe Oct 15 '21 at 12:27
Your question might be misleading then. `without any constraints` implies you don't want to do this as a hard constraint. I'd suggest then, that you overrule the agent and give negative reward so the agent stops selecting the action and it is never exectued. – tnfru Oct 15 '21 at 18:31
Thanks tnfru for your comment and effort. I really appreciate it. My question is if there is a direct way of telling this to the agent without any reward function. As far as I understand your answer you tend to say that there is not direct way of doing this? – PeterBe Oct 18 '21 at 12:07
To my knowledge none other than the overruling suggested above. – tnfru Oct 26 '21 at 17:31

Pulse9 · Accepted Answer · 2021-11-07T13:24:29.243

1

I'm not an expert, but as far as I understand, you should use an off-policy algorithm, the difference between is:

On-Policy: The agent learns the value function according to the current action derived from current the policy being used. Off-Policy: The agent learns the value function according to the action derived from another policy.

This means that you can use another policy to explore. For example, if you use a Q-Learning (not your case because of the continuos values of your problem) that is an off-policy approach, you can explore with a particular policy to get the actions (you can only select valid actions) then you can update your q-table with the Q-Learning equation.

In your case you can use an off-policy deep approach. I suggest DDPG/TD3, you can look about some of them briefly here.

The idea is to use an exploration policy, the one you restrict to only select valid values (hard-constraint), and integrate the State, Action, Reward, State' in the replay buffer. The Stable_Baseline library doesnt allow that, but you could check the original source code of TD3.

Edit1:

If you see in the Q learning algorithm, the e-greedy consist on selecting with a probability of $\epsilon$ $a \gets \text{any action}$, and with $1-\epsilon$ the $a \gets max_{a}Q(s,a)$. This $\text{any action}$ is the part of the code that you use this "controller" to only select random (but valid) actions. This is because you want to explore but only explore with valid actions. Then the Q learning can "exploit" picking the best action from the exploration you did before. Now, for your case with continuos actions, you can use DDPG/TD3 to do something similar but you store these valid actions in a replay buffer, so your Neural Network can learn for this "data" of only valid actions.

Edit 2:

In your custom environment you can define your action space like:

self.action_space = gym.spaces.Box(low=-1, high=1, shape=(1,))

Now, as you said, in the step function of your environment you can establish the x(t) and y(t)

maxX=10 #Depends on the maximum value of your x(t), I assigned a 10
maxY=10 #Depends on the maximum value of your y(t), I assigned a 10
x=0
y=0
if action>0:
    y=0
    x=action*maxX
elif action<0:
    x = 0
    # you need to multiply by -1 because your action is negative
    y = -1*action * maxY 
# do the rest of the code of your controler with x and y

In this way, your RL agent will learn which action (between -1 and 1) will get the best reward, but in the step function, you map the action [-1 +1] to your true values.

edited Nov 07 '21 at 13:24

answered Oct 20 '21 at 14:55

Pulse9

282
1
7

Thanks Pulse9 for your answer. I don't understand why I can't use Q-learning for that? I intend to use deep-Q-learning to map the inputs (states) to the outputs (actions). Further your wrote "The idea is to use an exploration policy, the one you restrict to only select valid values ". This is exactly my question. How can I restrict the agent to only select valid values? Of course I can just add another superior controller that can overrule the actions of the RL agent. But I would like to know if and how I can directly tell this to the RL agent while learning – PeterBe Oct 21 '21 at 12:41
Moreover I have some problems understanding your explanations about on and off policy learning. You wrote "Off-Policy: The agent learns the value function according to the action derived from another policy." What do you mean by another policy in this context? How can I change the policy? – PeterBe Oct 21 '21 at 12:48
I just improved my answer – Pulse9 Oct 21 '21 at 13:11
Thanks for your answer Pulse9. If I understood correctly your advice is to implement the Q-learning agent on my own and not use e.g. the DQNAgent from Keras? I would have to admit that I would not know how to implement this agent on my own and I assume that the effort would be way to high for that. Is it not possible to tell a predefined agent about some rules directly without implementing the whole agent from scratch? – PeterBe Oct 21 '21 at 13:16
I forgot to specify the algorithms that I recommend for your case (DDPG/TD3) with continuos actions, the current libraries (Keras/Stable baseline...) do not allow you to define these exploration policies, but you can use others code from the authors that is ready to use, for example here: https://github.com/sfujim/TD3 – Pulse9 Oct 21 '21 at 13:26
Thanks Pulse9 for your answers and effort. I really appreciate it. I have to admit that I am kind of confused at the moment. So I gernerally have the following questions 1) Why can I not use deep-Q-learning for my problem? As always I could discretize the continuous action space. Why is Q-learning still not applicable to my problem – PeterBe Oct 22 '21 at 09:25
2) Is there a direct way to tell the RL agent some rules e.g. in the step function of Keras RL? If not I'd use another super-ordinate controller that would overrule the actions of the RL agent. Implementing my own RL agent is not in the scope of my work due to the tremendous effort for that. 3) I do not understand your explanations about on and off policy learning at all. You wrote "Off-Policy: The agent learns the value function according to the action derived from another policy." What do you mean by another policy in this context? How can I change the policy? Why do I have to distinguish – PeterBe Oct 22 '21 at 09:27
Any comment to my last 2 comments? I'll highly appreciate every further comment from you. – PeterBe Oct 25 '21 at 09:45
Thanks Pulse9 for your answers. Any comments to my last comment? – PeterBe Nov 05 '21 at 08:17
Hello PeterBe, I just had a new idea. I think you can use any RL library but using only 1 action, for example from -1 to +1, if the value is 0, then x(t) and y(t) is 0, if the value is 1, you can map it to be x(t) = 1*(yourMaxValue) and y(t)=0, and backwards, if your action is -1 then y(t) = 1* (-yourMaxValue) and x(t)=0. And in this way the agent can learn and you can map the actions to your need. What you think =)? – Pulse9 Nov 06 '21 at 20:34
You can use PPO for example and its very easy to implement, I suggest to use the Stable Baseline 3 Library – Pulse9 Nov 06 '21 at 20:37
Thanks for your answer. Honestly I do not understand how your suggested approach can ensure for example that x and y are never greater 0 at the same time. Further, you suggest to adjust the actions, right? Where shall I do this (maybe in the step function of OpenAI Gym) and how it this different from a super-ordinate additional controller that can overrule the actions of the RL agent? – PeterBe Nov 07 '21 at 11:53
I imagine that you are doing a custom environment right? I will edit (edit 2) my answer – Pulse9 Nov 07 '21 at 13:12
Thanks Pulse9 for your answer and effort. I really appreciate it. I don't understand why you are using negative values? Neither x nor y can be negative. They are either 0 or between [0.25, 1] as written in my question. The second rule is that if x(t) is greater 0 then y(t) has to be 0 and vice versa. I don't see how you make this in your custom environament. Further, the action_space of my gym environment contains multiple variables that have different boundaries. So I can't just use one box variable for the actions. I need multiple variables with different limits for my actions. – PeterBe Nov 07 '21 at 14:25
And where do you use the PPO method that you suggested? As stated before I would like to use Deep-Q-Learning first and then also maybe SARSA. Can I combine the PPO method with them? – PeterBe Nov 07 '21 at 14:31
As I said, you map the RL action to your actions, you can see an example here: https://github.com/openai/gym/blob/master/gym/envs/classic_control/acrobot.py they map the action "a" in the step function using the variable AVAIL_TORQUE = [-1.0, 0.0, +1], you can see that they define the action space to a discrete (3) value: self.action_space = spaces.Discrete(3). The agent learns the best action 0,1,2, but in the step function, it is mapped to -1, 0, and +1. – Pulse9 Nov 07 '21 at 15:53
Thanks Pulse9 for your comment. Maybe some last follow up questions: 1) Is your suggested approach (mapping discrete actions to other actions in the step-function) the PPO method? 2) Can I use this method in combination with Deep-Q-Learning and SARSA (because I want to use those 2 approaches)? – PeterBe Nov 08 '21 at 08:25
3) can you tell from your personal experience whether it is really worth doing such a mapping in order to tell the agent some rules directly. Or is it just better to use a superordinate controller that can overrule the actions of the RL agent? I am asking because in my case there are 2 discrete variabels x and y that both have 10 discrete values. So when I combine them into 1 variable that will have 10 x 10 = 100 discrete values (each of them represents 1 tuple with two componentes (x,y) and will be mapped to it). I think this will make the learning way more difficult. – PeterBe Nov 08 '21 at 10:55
I upvoted and accepted your answer. Further, I awarded the bounty to you because of your tremendous help. I still hope that you will answer my 3 follow up questions. – PeterBe Nov 08 '21 at 18:00
1) Yes, I think it will be much easier for you controlling the mapping but letting the agent learn their own actions. Similar thinking that a videogame, you dont focus on moving your fingers, you focus on the game but your brain do the mapping without you realizing it. You can do it with PPO using stable baseline library, you can see here that it accept discrete and continuos (box) https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html – Pulse9 Nov 09 '21 at 08:32
2) DQN only accept discrete actions, you can see here: https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html and SARSA is also a discrete action, so I dont think you can map it to this range [0.25,1] – Pulse9 Nov 09 '21 at 08:33
3) I think is good, normal Deep RL works with continuos actions, is way more than 100 actions =), think about how many different values you can have between 2 real values ;). Try it, I think it will work great. Good Luck! – Pulse9 Nov 09 '21 at 08:37
Thanks for your comments Pulse9. I still have several questions because for me it is still quite unclear what to do. 1) Why can I not use DQL or Sarsa? Actually you wrote in 2) "DQN only accept discrete actions" and then in 3 "normal Deep RL works with continuos actions," Actually I can just discretize my continuous action space so I should be able to use DQN. 2) Why shall I not be able to map [0.25,1]? I can just do any mapping with the method you described. E.g. if the discrete action is 0 then the value is 0. If the discrete action is 1 then the value is 0.25. If action 2, value =0.3 e.g. – PeterBe Nov 09 '21 at 10:57
1

1) If you can discretize your space, then is fine, you can use DQL or SARSA. 2) I was saying about the mapping so your agent can learn that positive values (0,1] represent an x>0 and y=0 and negative values [-1,0) represent a y>0 and x=0, but these actions are coded inside your environment, the agent will learn only de action a=[-1,1]. – Pulse9 Nov 14 '21 at 20:56
Thanks once again for your tremendous effort. I really appreciate it. – PeterBe Nov 15 '21 at 13:08

score 0 · Answer 3 · answered Nov 06 '21 at 00:08

0

When you take a step in the DQL process, you sample a move based on the estimated qualities of each possible action. During that step, you can restrict your sampling method to have probability 0 of choosing the forbidden action.

answered Nov 06 '21 at 00:08

nnolte

101

Thanks nnolte for your answer. I have to admit that I do not really understand how I can do what you suggest. Would you mind elaborating a little bit more on that? Here are some questions that I have: 1) How can I estimate the quality of each possible action and where shall I do this? In the step function (of Open AI Gym)? 2) How can I restrict the undesired actions to probability 0? 3) Is this approach not the same as having a super-ordinate additinal controller that can overrule the actions of the RL agent? – PeterBe Nov 06 '21 at 11:49

Is it possible to tell the Reinforcement Learning agent some rules directly without any constraints?

3 Answers3

Linked