2

I am wondering how to correctly implement the DQN algorithm for two-player games such as Tic Tac Toe and Connect 4. While my algorithm is mastering Tic Tac Toe relatively quickly, I cannot get great results for Connect 4. The agent is learning to win quickly, if it gets the chance, but it only plays in the centre. It is unable to detect threats in the first and last columns. I am using DDQN with memory replay. Also teacher and student refer to two agents at different strengths, while the teacher is frequently replaced by a new student. My algorithm looks simplified as follows:

for i in range(episodes):
    observation = env.reset()
    done = False
    while not done:
        if env.turn == 1:
            action = student.choose_action(observation)
            observation_, reward, done, info = env.step(action)
            loss = student.learn(observation, action, reward, observation_, done))
            observation = observation_
        else:
            action = teacher.choose_action(-observation)
            observation_, reward, done, info = env.step(action)
            observation = observation_

The observation is -1 for player "o", 1 for player "x" and 0 for empty. The agent learns to play as player "x" and through action = teacher.choose_action(-observation) it should find the best move for player "o". I hope that is clear.

The update rule looks as follows:

# Get predicted q values for the actions that were taken
q_pred = Q_eval.forward(state, action)
# Get Q value for opponent's next move
state_ *= -1.
q_next = Q_target.forward(state_, max_action)
# Update rule
q_target = reward_batch - gamma * q_next * terminal
loss = Q_eval.loss(q_pred, q_target)

I am using -gamma * q_next * terminal, because the reward is negative, if the opponent wins in the next move. Am I missing anything important or is it just a question of hyperparameter tuning?

spadel
  • 31
  • 4

0 Answers0