Deciding the rewards for different actions in Pong for a DQN agent

Question

I am attempting to implement an agent that learns to play in the Pong environment, the environment was created in PyGame and I return the pixel data and score at each frame. I use a CNN to take a stack of the last 4 frames as input and predicts the best action to take, I also make use of training on a minibatch of experiences from an experience replay at each timestep.

I have seen an implementation where the game returned a reward of 10 for each time the bot returns the ball and -10 for each time the bot misses the ball.

My question is whether it would be better to reward the bot significantly for managing to get the ball passed the opponent, ending the episode. I was thinking of rewarding 10 for winning the episode, -10 for missing the ball and 5 for returning the ball.

Please let me know if my approach is sensible, has any glaring problems or if I need to provide more information.

Thank you!

Could you link the implementation you saw? I think the canonical version of Pong in RL just uses the game points scoring, i.e. +1 if you score, -1 if opponent scores. However, as a researcher you are free to come up with different ways of training an agent. — Neil Slater, Apr 13 '19 at 21:10
Do you have a specific reason for asking this question abstractly? If you can try both policies and compare them, why not just do this to directly answer the question? — Mathieu Bouville, Apr 14 '19 at 06:19

Deciding the rewards for different actions in Pong for a DQN agent

0 Answers0