This old question has no definitive answer yet, that's why I am asking it here again. I also asked this same question here.
If I'm doing policy gradient in Keras, using a loss of the form:
rewards*cross_entropy(action_pdf, selected_action_one_hot)
How do I manage negative rewards?
I've had success with this form in cases where the reward is always positive, but it does not train with negative rewards. The failure mode is for it to drive itself to very confident predictions all the time, which results in very large negative losses due to induced deviation for exploration. I can get it to train by clipping rewards at zero, but this throws a lot of valuable information on the table (only carrots, no sticks).