2

I'm creating a RF Q-Learning agent for a two player fully-observable board game and wondered, if I was to train the Q Table using adversarial training, should I let both 'players' use, and update, the same Q Table? Or would this lead to issues?

mason7663
  • 603
  • 3
  • 10

1 Answers1

0

should I let both 'players' use, and update, the same Q Table?

Yes this works well for zero-sum games, when player 1 wants to maximise a result (often just +1 for "player 1 wins") and player 2 wants to minimise a result (score -1 for "player 2 wins"). That alters algorithms such as Q-learning because the greedy choice switches beween min and max functions over action values - player 1's TD target becomes $R_{t+1} + \gamma \text{min}_{a'} [Q(S_{t+1}, a')]$ because the greedy choice in the next state is taken by player 2.

Alternatively, if the states never overlap between players, then you could learn a Q function that always returns the future expected return according the current player. This can be forced if necessary by making part of the state record whose turn it is. With this you need a way to convert between player 1 and player 2 scores in order to use Q learning updates. For zero-sum games, then the Q value from a player 2 state is negative of player 1's value for that same state, and vice versa, so the TD target changes for both players to $R_{t+1} - \gamma \text{max}_{a'} [Q(S_{t+1}, a')]$

The first option can result in slightly less complexity in the learned function, that might be an issue if you are using function approximation and learning a Q function e.g. using neural networks instead of a Q table. That may result in faster learning and generalisation, although it will depend on details of the game.

Or would this lead to issues?

No major issues. I am performing this kind of training - a single Q function estimating a global score which P1 maximises and P2 minimises - for the Kaggle Connect X competition, and it works well.

I can think of a couple of minor things:

  • You may still want to have the ability for each player to be using a different version of the table or learned Q function. This would allow you to have different versions of your agent (e.g. at different stages of learning) compete, to evaluate different agents against each other. To do this, you have to write code that allows for multiple tables or functions in any case.

  • You need to keep track of how both players express and achieve their opposing goals when using the table, as you can already see by the modified TD targets above. This becomes more important when adding look-ahead planning, which is a common addition and can significantly improve the performance of an agent - during look-ahead you must switch between player views on what the best action choice is. It is possible to make off-by-one errors in some part of the code but not others and have the agent learn inconsistently.

Neil Slater
  • 28,678
  • 3
  • 38
  • 60