7

I understand that this is the update for the parameters of a policy in REINFORCE:

$$ \Delta \theta_{t}=\alpha \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right) v_{t}, $$ where $v_t$ is usually the discounted future reward and $\pi_{\theta}\left(a_{t} \mid s_{t}\right)$ is the probability of taken the action that the agent took at time $t$. (Tell me if something is wrong here)

However, I don't understand how to implement this with a neural network.

Let's say that probs = policy.feedforward(state) returns the probabilities of taking each action, like [0.6, 0.4]. action = choose_action_from(probs) will return the index of the probability chosen. For example, if it chose 0.6, the action would be 0.

When it is time to update the parameters of the policy network, what should we do? Should we do something like the following?

gradient = policy.backpropagate(total_discounted_reward*log(probs[action])
policy.weights += gradient

And I only backpropagate this through one output neuron?

Which loss function should I use in this case? What would the labels be?

If you need more explanation, I have this question on SO.

nbro
  • 39,006
  • 12
  • 98
  • 176
S2673
  • 560
  • 4
  • 16
  • I use a cross entropy loss when I implement this in PyTorch. I tried doing a forward pass of the network evaluated at the current action and taking the log but it didn't seem to work when I implemented it, so I stuck with cross entropy loss. – David Sep 16 '20 at 16:31
  • Is this question related to a specific library (eg. pytorch or tensorflow)? Or is this just a question for a more specific but still general pseudocode? – Hai Nguyen Sep 16 '20 at 16:42
  • @David Ireland Cross entropy with what? What would be the label, y hat, or correct answer in the equation? – S2673 Sep 16 '20 at 16:51
  • @Hai Nguyen It is a lot about the general pseudocode, but if you click on the link you can see I built my own neural network out of just NumPy. – S2673 Sep 16 '20 at 16:53
  • 2
    @S2673 It is cross-entropy loss using as the label the action you actually took during sampling. Multiplying by the value function is really critical addition though, it can even reverse the sign of all the gradients (which makes sense - if you have chosen an action and it was a bad choice, you want to reduce the chance of taking it again) – Neil Slater Sep 16 '20 at 17:05
  • @Neil Slater So if you had the probabilities `[0.7,0.3]` and the policy chose `0.7` then you would cross entropy loss with `[0.7,0.3]` and label `[1,0]` then multiply the gradient by the total reward? And is that what happens in REINFORCE or is that something different? – S2673 Sep 16 '20 at 17:17
  • @S2673: Yes, that is essentially what happens in REINFORCE. You can (and often would) choose different multiplier, with baselines, but basic REINFORCE just uses the measured return. A common choice is some estimate of advantage (action value minus state value). – Neil Slater Sep 16 '20 at 17:23

1 Answers1

3

The loss function you are looking for is cross entropy loss. The 'label' that you use is the action you took at the time point you are updating for.

David
  • 4,591
  • 1
  • 6
  • 25
  • Okay, thank you. It was not working with my code so I had to build a new, simpler one to test it and it worked. That’s why it took so long. There’s just one more thing... How do I know to do this given the equation in my question? Or is this not what the equation meant? – S2673 Sep 18 '20 at 22:06
  • Should have clarified in my answer that I was referring to the [cross entropy loss in pytorch](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html) — if you read the documentation you see that this loss function is the log of the policy for the action you give it. – David Sep 19 '20 at 23:07
  • Thanks. Do you, by any chance, know how this works for a policy gradient with a continuous action space where the neural network outputs a Gaussian distribution? The equation for this update is the same but it is the log of a different output. – S2673 Sep 25 '20 at 01:20
  • 1
    @S2673 The algorithm doesn't change in this situation. Say your NN outputs the mean parameter of the Gaussian, then $\log_\pi(a_t | s_t)$ is just the log of the normal density evaluated at the action you took where the mean parameter in the density is the output of your NN. You are then able to backpropagate through this to update the weights of your network. – David Jan 07 '21 at 11:22
  • Thank you. I had figured this out a bit ago. – S2673 Jan 08 '21 at 03:24
  • @S2673 Apologies, I forgot to reply sooner – David Jan 08 '21 at 10:34