This is my first question on this forum and I would like to welcome everyone. I am trying to implement DDQN Agent playing Othello (Reversi) game. I have tried multiple things but the agent which seems to be properly initialized does not learn against random opponent. Actually the score is about 50-60% won games out of nearly 500. Generally if it gets some score after first 20-50 episodes it stays on the same level. I have doubts on the process of learning and how to decide when the agent is trained. Current flow is as follows:
- Initialize game state.
- With epsilon greedy policy choose the action to make based on currently available actions depending on game state
- Get opponent to make his action
- Get the reward as number of flipped places that remain after opponent move.
- Save the observation to replay buffer
- If number of elements in replay buffer equals or higher than batch size do the training.
What I do not know is when do I know when to stop the training. Previously this agent trained against MinMax algorithm learned how to win 100% games because MinMax played exactly the same every time. I would like the agent to generalize the game. Right now I save the network weights after the game is won but I think it does not matter. I can't see that this agent find some policy and improve over time. Whole code for the environment, agent and training loop can be found here: https://github.com/MikolajMichalski/RL_othello_mgr I would appreciate any help. I would like to understand how the RL works :)