5

I know with policy gradients used in an environment with a discrete action space are updated with $$ \Delta \theta_{t}=\alpha \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right) v_{t} $$ where $v_t$ could be many things that represent how good the action was. And I know that this can be calculated by performing cross entropy loss with the target being what the network would have outputted if it were completely confident in its action (zeros with the index of the action chosen being one). But I don’t understand how to apply that to policy gradients that output the mean and variance of a Gaussian distribution for a continuous action space. What is the loss for these types of policy gradients?

I tried keeping the variance constant and updating the output with mean squared error loss and the target being the action it took. I thought this would end up pushing the mean towards actions with greater total rewards but it got nowhere in OpenAI’s Pendulum environment.

It would also be very helpful if it was described in a way with a loss function and a target, like how policy gradients with discrete action spaces can be updated with cross entropy loss. That is how I understand it best but it is okay if that is not possible.

Edit: for @Philipp. The way I understand it is that the loss function is the same with a continuous action space and the only thing that changes is the distribution that we get the log-probs from. In PyTorch we can use a Normal distribution for continuous action space and Categorical for discrete action space. The answer from David Ireland goes into the math but in PyTorch, that looks like log_prob = distribution.log_prob(action_taken) for any type of distribution. It makes sense that for bad actions we would want to decrease the probability of taking the action. Below is working code for both types of action spaces to compare them. The continuous action space code should be correct but the agent will not learn because it is harder to learn the right actions with a continuous action space and our simple method isn't enough. Look into more advanced methods like PPO and DDPG.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions.categorical import Categorical #discrete distribution
import numpy as np
import gym
import math
import matplotlib.pyplot as plt

class Agent(nn.Module):
    def __init__(self,lr):
        super(Agent,self).__init__()
        self.fc1 = nn.Linear(4,64)
        self.fc2 = nn.Linear(64,32)
        self.fc3 = nn.Linear(32,2) #neural network with layers 4,64,32,2

        self.optimizer = optim.Adam(self.parameters(),lr=lr)

    def forward(self,x):
        x = torch.relu(self.fc1(x)) #relu and tanh for output
        x = torch.relu(self.fc2(x))
        x = torch.sigmoid(self.fc3(x))
        return x

env = gym.make('CartPole-v0')
agent = Agent(0.001) #hyperparameters
DISCOUNT = 0.99
total = []

for e in range(500): 
    log_probs, rewards = [], []
    done = False
    state = env.reset()
    while not done:
        #mu = agent.forward(torch.from_numpy(state).float())
        #distribution = Normal(mu, SIGMA)
        distribution = Categorical(agent.forward(torch.from_numpy(state).float()))
        action = distribution.sample()
        log_probs.append(distribution.log_prob(action))
        state, reward, done, info = env.step(action.item())
        rewards.append(reward)
        
    total.append(sum(rewards))

    cumulative = 0
    d_rewards = np.zeros(len(rewards))
    for t in reversed(range(len(rewards))): #get discounted rewards
        cumulative = cumulative * DISCOUNT + rewards[t]
        d_rewards[t] = cumulative
    d_rewards -= np.mean(d_rewards) #normalize
    d_rewards /= np.std(d_rewards)

    loss = 0
    for t in range(len(rewards)):
        loss += -log_probs[t] * d_rewards[t] #loss is - log prob * total reward

    agent.optimizer.zero_grad()
    loss.backward() #update
    agent.optimizer.step()

    if e%10==0:
        print(e,sum(rewards)) 
        plt.plot(total,color='blue') #plot
        plt.pause(0.0001)    


def run(i): #to visualize performance
    for _ in range(i):
        done = False
        state = env.reset()
        while not done:
            env.render()
            distribution = Categorical(agent.forward(torch.from_numpy(state).float()))
            action = distribution.sample()
            state,reward,done,info = env.step(action.item())
        env.close()

Above is the discrete action space code for CartPole and below is the continuous action space code for Pendulum. Sigma (variance or standard deviation) is constant here but adding it is easy. Just make the final layer have two neurons and make sure sigma is not negative. Again, the pendulum code won't work because most environments with continuous action spaces are too complicated for such a simple method. Making it work would probably require a lot of testing for hyper parameters.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions.normal import Normal #continuous distribution
import numpy as np
import gym
import math
import matplotlib.pyplot as plt
import keyboard

class Agent(nn.Module):
    def __init__(self,lr):
        super(Agent,self).__init__()
        self.fc1 = nn.Linear(3,64)
        self.fc2 = nn.Linear(64,32)
        self.fc3 = nn.Linear(32,1) #neural network with layers 3,64,32,1

        self.optimizer = optim.Adam(self.parameters(),lr=lr)

    def forward(self,x):
        x = torch.relu(self.fc1(x)) #relu and tanh for output
        x = torch.relu(self.fc2(x))
        x = torch.tanh(self.fc3(x)) * 2
        return x

env = gym.make('Pendulum-v0')
agent = Agent(0.01) #hyperparameters
SIGMA = 0.2
DISCOUNT = 0.99
total = []

for e in range(1000): 
    log_probs, rewards = [], []
    done = False
    state = env.reset()
    while not done:
        mu = agent.forward(torch.from_numpy(state).float())
        distribution = Normal(mu, SIGMA)
        action = distribution.sample().clamp(-2.0,2.0)
        log_probs.append(distribution.log_prob(action))
        state, reward, done, info = env.step([action.item()])
        #reward = abs(state[1])
        rewards.append(reward)
        
    total.append(sum(rewards))

    cumulative = 0
    d_rewards = np.zeros(len(rewards))
    for t in reversed(range(len(rewards))): #get discounted rewards
        cumulative = cumulative * DISCOUNT + rewards[t]
        d_rewards[t] = cumulative
    d_rewards -= np.mean(d_rewards) #normalize
    d_rewards /= np.std(d_rewards)

    loss = 0
    for t in range(len(rewards)):
        loss += -log_probs[t] * d_rewards[t] #loss is - log prob * total reward

    agent.optimizer.zero_grad()
    loss.backward() #update
    agent.optimizer.step()

    if e%10==0:
        print(e,sum(rewards)) 
        plt.plot(total,color='blue') #plot
        plt.pause(0.0001)
        if keyboard.is_pressed("space"): #holding space exits training
            raise Exception("Exited")


def run(i): #to visualize performance
    for _ in range(i):
        done = False
        state = env.reset()
        while not done:
            env.render()
            distribution = Normal(agent.forward(torch.from_numpy(state).float()), SIGMA)
            action = distribution.sample()
            state,reward,done,info = env.step([action.item()])
        env.close()

David Ireland also wrote this on a different question I had:

The algorithm doesn't change in this situation. Say your NN outputs the mean parameter of the Gaussian, then logπ(at|st) is just the log of the normal density evaluated at the action you took where the mean parameter in the density is the output of your NN. You are then able to backpropagate through this to update the weights of your network.

David
  • 4,591
  • 1
  • 6
  • 25
S2673
  • 560
  • 4
  • 16
  • 2
    You might want to first express your policy as a density function over the space of possible actions. If you have one of these for each state, and the states are a continuous space as well, you might want to look at Gaussian processes and the likes. – Robby Goetschalckx Sep 30 '20 at 22:36
  • @Robby I don’t understand exactly what you mean. Are you saying I should make the continuous action space discrete? Like if the range of possible actions is -1 to 1 then have '[-1,-0.5,0,0.5,1]' as the only possible actions? I already know how to do that, I am looking for a way to do it with continuous action spaces for problems where that would not work. – S2673 Oct 01 '20 at 01:34
  • 1
    No, you don't want it to be discrete. A policy over a continuous space is necessarily a probability density function. With enough training, this should converge to a Dirac delta function centered on the optimal action. If you parameterize a density, you can use that in the formula you have, and update the parameters according to the gradient. – Robby Goetschalckx Oct 01 '20 at 05:49
  • @RobbyGoetschalckx I think worth puting in an answer. Probably the most usual way to do this with is have the NN output params of a gaussian - mean and sd (or variance). Showing the gradient terms for mean and sd output when choosing action=x from that distribution should answer OP's question. – Neil Slater Oct 01 '20 at 06:49
  • @Robby The goal is for it to converge to a Dirac delta function. But what does that update look like with the network outputting the mean and variance of a Gaussian distribution? I am using the Gaussian distribution because it seems like the most common one and after a quick google search I couldn’t find an example of policy gradients with a different distribution function. – S2673 Oct 01 '20 at 11:39
  • @Neil Thanks, that is exactly what I am looking for. – S2673 Oct 01 '20 at 11:39
  • 1
    @S2673 Did you eventually manage to implement working code from the answer? I find it very difficult to wrap my head around the accepted answer.. – Philipp Mar 30 '21 at 16:59
  • 1
    @Philipp I did get it working and I can’t right now but soon I will look at my code to remember how I solved it. – S2673 Mar 31 '21 at 02:37
  • 1
    @Philipp Sorry it took so long but I finally edited my question to try to help you. I had forgotten a lot of this code. I'm not sure exactly what part you were having trouble with so you can ask more questions if you need. After getting this right, you can code more advanced algorithms like PPO that can actually solve the environments. And I saw the question on your profile, I also thought that I could keep using cross-entropy loss in continuous environments but I don't know if it works because there is not really a clear target. – S2673 Apr 03 '21 at 21:52
  • Thank you! This is the most helpful resource on this topic! One question: You calculate a single loss value for every episode based on the cumulative reward at each step and the negative log_prob. Then you back-propagate with this loss value. Could one also use the cumulative reward to find good episodes and then calculate the loss for each step in these good episodes? Then the loss would be just the negative log_prob. I have seen this style with the cross entropy approach, where one would find "elite" episodes and then calculate losses for each step in them. Why didn't you do it like this? – Philipp Apr 04 '21 at 13:40
  • 1
    @Philipp I didn’t do it like that because I have never heard of it but it sounds interesting. I think it would be a good idea to experiment with. I just learned the method I use and it has worked for me but it sounds like there is no reason your strategy wouldn’t work with a continuous action space if it worked in a discrete action space with cross-entropy. Do you have a link to an example? – S2673 Apr 04 '21 at 13:52
  • 1
    @Philipp Maybe we should also talk about this in the [chat](https://chat.stackexchange.com/rooms/122653/loss-for-continous-action-space-comment-discussion) instead. I've never tried this before so it might not work... – S2673 Apr 04 '21 at 14:01

1 Answers1

3

This update rule can still be applied in the continuous domain.

As pointed out in the comments, suppose we are parameterising our policy using a Gaussian distribution, where our neural networks take as input the state we are in and output the parameters of a Gaussian distribution, the mean and the standard deviation which we will denote as $\mu(s, \theta)$ and $\sigma(s, \theta)$ where $s$ shows the dependancy of the state and $\theta$ are the parameters of our network.

I will assume a one-dimensional case for ease of notation but this can be extended to multi-variate cases. Our policy is now defined as $$\pi(a_t | s_t) = \frac{1}{\sqrt{2\pi \sigma(s_t, \theta)^2}} \exp\left(-\frac{1}{2}\left(\frac{a_t - \mu(s_t, \theta)}{\sigma(s_t, \theta)}\right)^2\right).$$

As you can see, we can easily take the logarithm of this and find the derivative with respect to $\theta$, and so nothing changes and the loss you use is the same. You simply evaluate the derivative of the log of your policy with respect to the network parameters, multiply by $v_t$ and $\alpha$ and take a gradient step in this direction.

To implement this (as I'm assuming you don't want to calculate the NN derivatives by hand) then you could do something along the lines of the following in Pytorch.

First you want to pass your state through your NN to get the mean and standard deviation of the Gaussian distribution. Then you want to simulate $z \sim N(0,1)$ and calculate $a = \mu(s,\theta) + \sigma(s, \theta) \times z$ so that $a \sim N( \mu(s, \theta), \sigma(s, \theta))$ -- this is the reparameterisation trick that makes backpropagation through the network easier as it takes the randomness from a source that doesn't depend on the parameters of the network. $a$ is your action that you will execute in your environment and use to calculate the gradient by simply writing the code torch.log(normal_pdf(a, \mu(s, \theta), \sigma(s, \theta)).backward() -- here normal_pdf() is any function in Python that calculates the pdf of a normal distribution for a given point and parameters.

David
  • 4,591
  • 1
  • 6
  • 25
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackexchange.com/rooms/118263/discussion-on-answer-by-david-ireland-what-is-the-loss-for-policy-gradients-with). – nbro Jan 09 '21 at 22:25
  • 1
    A log-normal distribution is not the same as taking the log of the pdf of the normal distribution, it is taking the log of the random variable. – David Mar 31 '21 at 11:08
  • 1
    Ah sorry, also just realized that and deleted my comment, but thanks! – Philipp Mar 31 '21 at 11:09
  • @DavidIreland I modified my neural net, so that the output now has two values (mu and sigma). I saw others implement them in multiple output layer with different activation functions (tanh for mu and softplus for sigma). Is that necessary or can I just have an output layer with two neurons for mu and sigma? – Philipp Apr 01 '21 at 10:48
  • My main problem however is, that I do not really understand, why you say to just calculate the gradients of the probability density function (pdf; the function for the policy, correct?). Usually I calculate some loss from the output of my neural net and a target output and then call `loss.backward()` and then `optimizer.step()`. How do I include a loss and an optimizer with this probability density function? Could you please explain this? Maybe in an edit to your answer.. – Philipp Apr 01 '21 at 10:48
  • 1
    I imagine that when you have seen things like tanh for mu it is for stability. For sigma it is necessary that sigma is non-zero so you will need to force it to be non-zero. I am not an expert numerical stability of neural networks and such but I have had problems in the past, in particular when predicting sigma, is that it is very unstable. What I have seen people do (as usually in ML you end up taking the log of a density) is to directly output the log of the variance, this way you don't need to worry about it being strictly bigger than 0. – David Apr 01 '21 at 11:56
  • 1
    you can see in the code I attached that I take `backward()` of the log of the pdf. I could similarly have called `loss = torch.log(normal_pdf(a, \mu(s, \theta), \sigma(s, \theta))` and then called `loss.backward()` and this would be using the more traditional variable names. Note that when you do `optimiser.step()` gradient _descent_ is performed so I should technically call `loss = -torch.log(normal_pdf(a, \mu(s, \theta), \sigma(s, \theta))` as then minimising this will be equivalent to maximising the RL objective. Please let me know if this makes sense. – David Apr 01 '21 at 12:05
  • @DavidIreland That makes sense. So we need to take the negative of the log_prob to transform the concave curve (n-shaped) into a convex curve (u-shaped), for which we can find the minimum. This assumes, that the minimum of this curve represents the best mu and the optimal sigma there would be zero, right? Also, why do we need to apply the log to the pdf in ther first place? Wouldn't it be sufficient to calculate the loss as the negative of the pdf itself (given a mu, sigma and action)? – Philipp Apr 04 '21 at 14:04
  • 2
    @Philipp My advice would be to not think about curve fitting when thinking of RL, it is a very different paradigm to supervised/unsupervised learning. My main advice would be to make sure you go through all the theory (Sutton and Barto is always my go to recommendation) so that you understand why we do what we do. The reason we take the log of the policy is because of how the policy gradient is derived; it is difficult to explain in this little comment section but if you follow the derivation of the policy gradient and how we end up at the REINFORCE update rule it'll make sense. – David Apr 04 '21 at 19:55
  • 1
    @Philipp essentially we want to maximise $J(\theta) = v_\pi(s_0)$ where $\theta$ are the policy parameters. We can show that $\nabla_\theta J(\theta) = \mathbb{E}_\mu [ G_t \nabla_\theta \pi_\theta(a|S)]$, where $\mu$ is the state distribution induced by the policy. We can make the expectation also over the action space by writing $\nabla_\theta J(\theta) = \mathbb{E}_{\pi, \mu} [ G_t \frac{\nabla_\theta \pi_\theta(A|S)}{\pi_\theta(A|S)}]$ where the fraction is equal to the $\log$ of the policy, hopefully this clears things up for you :-) – David Apr 04 '21 at 20:00
  • @DavidIreland If the policy is a Gaussian density function, the gradients have variance in the denominator. How can the algorithm be numerically stable when the learned variance approaches zero? – Cloudy Jan 12 '22 at 06:44
  • in practice you would either not learn the variance (because in general learning a variance can be unstable) or you would clip it to make sure it doesn’t ever reach 0. – David Jan 12 '22 at 09:28