How do you manage negative rewards in policy gradients?

Question

This old question has no definitive answer yet, that's why I am asking it here again. I also asked this same question here.

If I'm doing policy gradient in Keras, using a loss of the form:

rewards*cross_entropy(action_pdf, selected_action_one_hot)

How do I manage negative rewards?

I've had success with this form in cases where the reward is always positive, but it does not train with negative rewards. The failure mode is for it to drive itself to very confident predictions all the time, which results in very large negative losses due to induced deviation for exploration. I can get it to train by clipping rewards at zero, but this throws a lot of valuable information on the table (only carrots, no sticks).

If you have a question that has already been asked but no good answer has been given, one way to drive attention to that question is to [start a bounty](https://stackoverflow.com/help/bounty). — nbro, Nov 01 '20 at 22:52

Brale · Answer 1 · 2020-05-18T19:11:50.887

1

You don't need to manage negative rewards separately, if you implemented the algorithm correctly it will work regardless if the rewards are negative or not. You seem to be using rewards for the loss but you should be using the return which is the sum of the rewards for some state action pair from that point until the end of trajectory.

You also seem to be missing $-$ sign from the loss. The objective function for the vanilla policy gradient algorithm (REINFORCE) which we want to maximize is \begin{equation} J = \sum_a \pi(a|s) q_{\pi}(s, a) \end{equation} It can be shown that the gradient sample for this policy gradient method is \begin{equation} \nabla J = G_t \nabla \log (\pi(A_t|S_t)) \end{equation} so in TensorFlow you should define your loss as \begin{equation} J = - G_t \pi(A_t|S_t) \end{equation} We need the $-$ because in TensorFlow you use minimizers, but we want to maximize this function so minimizing this loss is same as maximizing the objective function. In conclusion, the code similar to what you wrote, should be
-return * cross_entropy(action_pdf, selected_action_one_hot)

EDIT

As pointed out in the comment we don't actually need $-$ because it is already included in cross_entropy function.

edited May 18 '20 at 19:11

answered May 18 '20 at 17:22

Brale

2,306
1
5
14

1

Thanks. I'm not sure about the sign as you have it. cross_entropy loss in keras has an implied negative sign relative to the normal definition of crossentropy. My models train without the minus sign. See also: https://stackoverflow.com/a/56893454/2364295 – Mastiff May 18 '20 at 18:39
But regardless of the above. My issue is with negative rewards (or returns). I implement exploration by deviating from the model PDF output in some cases. If the model outputs nearly 100% confident actions and the exploration logic deviates, a negative return can lead to a hugely negative reward. Perhaps my exploration approach is invalid. – Mastiff May 18 '20 at 18:42
You're right about the $-$, I edited the answer, as for the returns are you saying that the model drops performance when you choose an action that's not the one given by the model ? How do you choose it, do you sample it from the output distribution ? – Brale May 18 '20 at 19:11
Also, make sure you ciip the gradients, because the larger return will cause large parameter updates and the policy may break with too large updates because it will deviate from the current one too much. – Brale May 18 '20 at 19:16
When training I allow the model to explore by doing a proportional draw from the model provided softmax PDF. The detail is that I also broaden the PDF a bit from the model version so it doesn't collapse and stop exploring (by adding a small value and renormalizing). If the action I choose happens to have zero probability from the model, and negative return, the loss will be hugely negative. I'm experimenting now with putting the PDF softening into the model itself, controlled by an input, so I can have the selected action always be consistent with the model. – Mastiff May 18 '20 at 19:50

How do you manage negative rewards in policy gradients?

1 Answers1