What are good learning strategies for Deep Q-Network with opponents?

Question

I am trying to find out what are some good learning strategies for Deep Q-Network with opponents. Let's consider the well-known game Tic-Tac-Toe as an example:

How should an opponent be implemented to get good and fast improvements?
Is it better to play against a random player or a perfect player or should the opponent be a DQN player as well?

Neil Slater · Answer 1 · 2018-04-04T14:18:44.970

In a two player zero-sum game (if I win, you lose and vice-versa), then you can have a simple and efficient solution learning from self-play.

How should an opponent be implemented to get good and fast improvements?

You don't need to think in terms of agent vs opponent, instead think in terms of coding both the players' goals into a single Q function. Score +1 if player A wins, -1 if player B wins, and zero for a draw. It is then player A's goal to maximise the score and player B's goal to minimise the score.

You can then implement and learn both player strategies in the same self-play learning session and same Q function, using minimax. In practice that means where in Q learning you generally pick the maximising action on next state to bootstrap Q values, in a minimax variant, you pick maximising or minimising action depending on whose turn it is. Otherwise the Q learning algorithm is the same as normal. I have implemented this, but not for DQN, just for tabular Q-learning - feel free to learn from, copy and/or re-use any part of that code.

Is it reasonable to play against a random player, a perfect player or should the opponent be a DQN player as well?

The Q learner will learn to optimise against whichever player you make it play against. Against a random player, it will not necessarily learn to play well, just well enough to defeat the randomness. It may even make deliberate mistakes - such as not blocking a winning line - knowing it has a better chance to win due to random opponent.

Against a perfect player is possible with tic tac toe (because you can construct such a player), although there might be gaps in the trained Q values - game states never seen - which mean that an imperfect opponent could actually defeat the trained agent! In practice this does not scale to more complex unsolved games, because no perfect players exist.

Another DQN player should work fine. You would end up with two agents, each specialising in playing one player's turns. This is less efficient than a single minimax-based player, but no expected problems. It may be a preferred choice for some games, especially if they are not zero-sum.

Thank you very much for your fast and detailed answer. I will give a try to those hints. — murthy10, Apr 04 '18 at 14:20
Richard Garfield defines luck as "uncertainty in outcomes". Athough I don't necessarily love this definition for the term (speaking as a deterministic game designer) his point is well taken when he applies this to non-trivial, non-chance, perfect information games such as Chess. (Essentially it's an extension of the "infinite monkey theorem" applied to unsolved games. Because perfect play cannot be validated, there is always a chance, however minuscule, of being beaten by an opponent making purely random choices.) — DukeZhou, Apr 04 '18 at 18:32
@Neil Slater and do you have some good advices if there are more than two players or even a player in your team? — murthy10, Apr 06 '18 at 17:22
@murthy10: I am less sure about multi-player scenarios, and do not have much advice. If the team objective is to win as a unit, and players can share information freely, then it could make sense to model the team as an agent (as opposed to each player separately). However, if you want to code a game agent that could play as a team member alongside a human player, that would not work. — Neil Slater, Apr 06 '18 at 17:29

What are good learning strategies for Deep Q-Network with opponents?

1 Answers1