3

I am wondering how am I supposed to train a model using actor/critic algorithms in environments with opponents. I tried the followings (using A3C and DDPG):

  1. Play against random player. I had rather good results, but not as good as expected since most interesting states cannot be reached with a random opponent.
  2. Play against list of specific AIs. Results were excellent against those AIs, but very bad with never seen opponents
  3. Play against itself. Seemed the best to me, but I could not get any convergence due to non-stationary environment.

Any thought or advice about this would be very welcome.

Loheek
  • 266
  • 2
  • 6
  • 1
    I've recently had a paper accepted at a conference (COG 2019) which would be very relevant to any answer I'd write to this question. Since it only went through peer review relatively recently, it's not publicly available yet. But I do plan to put it up on arXiv very soon (during the coming week somewhere). In case I forget to come back to this question afterwards... feel free to poke me here in a few days to remind me. – Dennis Soemers May 11 '19 at 19:55
  • I look forward to reading this, thank you ! I am surprised of the difficulties I had to find any relevant paper on this question. – Loheek May 11 '19 at 20:07
  • @DennisSoemers poking you on the paper! – DukeZhou Nov 08 '19 at 22:12
  • @DukeZhou It's in my answer down below, though that link is still the arXiv version. The PDF link from the conference itself is here: http://www.ieee-cog.org/2019/papers/paper_91.pdf . We barely missed out on Best Paper Award for this paper, got second place! – Dennis Soemers Nov 09 '19 at 09:30
  • Lol the Nash-Equilbrium theorem guarantees convergence. You just need to improve stability (e.g. with a greater diversity of starting agents & more compute). – profPlum Jul 10 '23 at 01:11

1 Answers1

2

The specific approaches you mentioned (A3C, DDPG), and usually also other Actor-Critic methods in general, are approaches for the standard single-agent Reinforcement Learning (RL) setting. When trying to apply such algorithms to settings that are actually multi-agent settings, it is indeed common to encounter the problems you describe. They can all be viewed as your agent "overfitting" to the opponent they're training against:

  1. Play against random player. I had rather good results, but not as good as expected since most interesting states cannot be reached with a random opponent.

In this case, your agent will "overfit" against random opponents. It will likely become capable of beating them relatively easily, and to some extent this may also still translate to relatively weak non-random agents (like simple heuristic approaches), but it is unlikely to generalise to strong opponents.

  1. Play against list of specific AIs. Results were excellent against those AIs, but very bad with never seen opponents

Here you pretty much exactly give a description of what I mean when I say "overfitting" against the opponent you're training against.

  1. Play against itself. Seemed the best to me, but I could not get any convergence due to non-stationary environment.

When training using self-play like this, there is still a danger of overfitting against "itself". This may turn out to work well in some games, but in other games, there are indeed risk of instability / lack of convergence. This can, for example, be due to the agent "rotating" through a circle of strategies that beat each other. Intuitively, you could think of Rock-Paper-Scissors. A randomly-initialised agent may, for example, have a slight tendency to play Rock more often than the others. Self-play will then lead to a strategy that primarily plays Paper, to beat the current "Rock" strategy. Continued self-play training can then figure out that a tendency to play more Scissors will be strong against itself. Etc., these strategies can keep rotating forever.


The standard approach for self-play training in games is the one popularised by AlphaGo Zero and AlphaZero, and also around the same time independently published with results on the game of Hex. That last paper coined the term "Expert Iteration" for the approach, which I like to use. The basic idea of Expert Iteration is to have:

  1. A policy that we're training (parameterised by some function approximator like a Deep Neural Net)
  2. A tree search algorithm, like Monte Carlo tree search (MCTS).

These two components can then iteratively improve each other. The trained policy can be used to guide the search behaviour of MCTS. MCTS can be used to generate a new distribution over actions which, thanks to additional search effort, is typically expected to be a little bit better than then trained policy. That slightly better distribution is then used as a training target for the policy, using a Cross-Entropy loss. This means that the policy is trained to mimic the MCTS search behaviour, and also used to improve that same MCTS search behaviour.

This training procedure has been found to lead to strong results. One of the prevailing hypotheses is that the additional search performed by MCTS can help to stabilise training, and avoid overfitting to the "self" opponent; the search algorithm can actually "reason" about what a strong opponent (not necessarily the "self" agent) could do.


The Expert Iteration training procedure I described above often works well, but it doesn't really answer your question about training Actor-Critic approaches in games with opponents... because it's not really an Actor-Critic algorithm!

In Learning Policies from Self-Play with Policy Gradients and MCTS Value Estimates, we (I'm first author on this paper) propose an approach where the policy in self-play is trained using a training target based on the value estimates of MCTS, rather than the visit counts of MCTS (which is what we use in the standard Expert Iteration framework).

I don't know if it necessarily should exactly be called an Actor-Critic algorithm, but you could intuitively view it as an "Actor-Critic" approach where MCTS is the critic. As mentioned right before Section IV, the resulting gradient estimator also turns out to be very similar to that of the "Mean Actor Critic".


If Expert-Iteration-style solutions are not an option (for example because tree search is infeasible), I would suggest taking a look at some of the following Policy-Gradient-based approaches (and possibly later research that cites these papers):

Dennis Soemers
  • 9,894
  • 2
  • 25
  • 66