1

This is my first question on this forum and I would like to welcome everyone. I am trying to implement DDQN Agent playing Othello (Reversi) game. I have tried multiple things but the agent which seems to be properly initialized does not learn against random opponent. Actually the score is about 50-60% won games out of nearly 500. Generally if it gets some score after first 20-50 episodes it stays on the same level. I have doubts on the process of learning and how to decide when the agent is trained. Current flow is as follows:

  1. Initialize game state.
  2. With epsilon greedy policy choose the action to make based on currently available actions depending on game state
  3. Get opponent to make his action
  4. Get the reward as number of flipped places that remain after opponent move.
  5. Save the observation to replay buffer
  6. If number of elements in replay buffer equals or higher than batch size do the training.

What I do not know is when do I know when to stop the training. Previously this agent trained against MinMax algorithm learned how to win 100% games because MinMax played exactly the same every time. I would like the agent to generalize the game. Right now I save the network weights after the game is won but I think it does not matter. I can't see that this agent find some policy and improve over time. Whole code for the environment, agent and training loop can be found here: https://github.com/MikolajMichalski/RL_othello_mgr I would appreciate any help. I would like to understand how the RL works :)

  • Hi Mikołaj and welcome! The best strategy for the game of reversi is somewhat counterintuitive. It's actually preferable to have little pieces of your color on the board at the beginning or midgame, and only in the very end flip the situation in your favor. The way you currently define your reward function encourages the agent to be greedy in the beginning, which may lead to losing the control of the corners (which are important). You want the agent to learn to win by itself, and your reward function should only award points for winning/drawing and losing. – mark mark Dec 23 '20 at 16:07
  • Hi, thank you for the answer. I changed reward to be only given after the episode is finished. Now it's counted as learning_agent_score - opponent_score. I can see a bit of improvement but is is not satisfying enough. I may experiment with hyper parameters. I have other question... Is it a proper way to train agent against random opponent? I thought about the policy of setting opponents weights to learning agent weights every time the win percentage gets higher so the trained agent plays against better opponent over time. – Mikołaj Michalski Dec 28 '20 at 15:57
  • To best of my knowledge, self-play would preferred over playing versus random opponent in terms of learning efficiency, since the state space is huge. Training network against itself using some minmax variation should be able to pick up some strategies. You can then evaluate the performance of the network against random opponent. Though, it might take a good while before agent learns anything the game is very complex (less than, but comparable to chess). – mark mark Dec 28 '20 at 17:06
  • As I did some experiments with hyperparameters and did not see much progress... I am coming back with more questions :) 1. What would be the best network architecture in this case? Mine is 1st layer - input 64 of size with ReLU, 2nd layer - hidden with ReLU- 32 of size, 3rd layer - 64 of size linear and 4th with softmax. I am not sure if this is proper approach but I can't find good source about building good network arch. 2. Currently I am getting minibatches with random samples for training. Is it good as the states sequentions for each games are highly dependent? – Mikołaj Michalski Jan 10 '21 at 13:57
  • And another one: 3. Should the training be done after each episode or each step? – Mikołaj Michalski Jan 10 '21 at 14:05
  • Hello again! I worked on a similar problem before - I tried to solve mancala game with DDQN (which I could not solve with that algorithm and currently trying some others). For the network architecture I only found vague suggestion -- seems that this topic is not well understood within the community. I'm inclined to think that you may need more neurons in each layer (a couple of hundreds), because neurons need to look at relations between pieces too and that's a huge space. – mark mark Jan 11 '21 at 14:42
  • The second question - yes, that's fine. DDQN should converge to optimal policy in theory, regardless of what you're feeding to it. The third - the training should be done at the end of the episode, provided, there's enough samples. – mark mark Jan 11 '21 at 14:42
  • I would also suggest to test your algorithm on a simpler game for a sanity check. Maybe not as simple as tic-tac-toe, but something that has smaller state/action space and hence is easier for agent to find patterns in – mark mark Jan 11 '21 at 14:45

0 Answers0