How does DeepMind perform reinforcement learning on a TPU?

Question

I've watched this video of the recent contest of AlphaStar Vs Pro players of StarCraft2, and during the discussion David Silver of DeepMind said that they train AlphaStar on TPUs.

My question is, how is it possible to utilise a GPU or TPU for reinforcement learning when the agent would need to interact with an environment, in this case is the StarCraft game engine?

At the moment with my training of a RL agent I need to run it on my CPU, but obviously I'd love to utilise the GPU to speed it up. Does anyone know how they did it?

Here's the part where they talk about it, if anyone is interested:

https://www.youtube.com/watch?v=cUTMhmVh1qs&t=7030s

My first guess would be that they run the Starcraft engine on that TPU. — John Dvorak, Feb 01 '19 at 10:25
As I understand it, a TPU isn't a self-contained computer to be able to load an environment like StarCraft onto it? Maybe you can, I don't know. I thought it was essentially like a GPU architecture but with an higher number of cores. I just can't think how you could put a game environment on it. — BigBadMe, Feb 01 '19 at 10:56
I do know you can implement a full-blown raytracer in GPU code (specifically, CUDA). As for utilizing parallelization, Blender renders each pixel in a tile at the same time, and DeepMind should be able to use the same trick - each core running one instance of the game - if they use neural evolution rather than gradient descent - and I have no idea how you would gradient-descend a Starcraft AI. Just guesswork on my side though, I don't know how TPUs actually work. — John Dvorak, Feb 01 '19 at 14:03
TPUs are hardware accelerators. There is no way to "run" a game on an accelerator, which as @BigBadMe pointed out, is not a computer. The game would need to be heavily customised to make use of TPUs as a form of accelerator - this is most certainly not the case. — MasterScrat, Feb 27 '20 at 14:08
Starcraft is not differentiable, so you can't learn to play using gradient descent directly, which is why they use reinforcement learning. — MasterScrat, Feb 27 '20 at 14:13

score 3 · Answer 1 · answered Feb 01 '19 at 16:10

In their blog post, they link to (among many other papers) their IMPALA paper. Now, the blog post only links to that paper with text implying that they're using the "off-policy actor-critic reinforcement learning" described in that paper, but one of the major points of the IMPALA paper is actually an efficient, large-scale, distributed RL setup.

So, until we get more details (for example in their paper that's currently under review), our best guess would be that they're also using a similar kind of distributed RL setup as described in the IMPALA paper. As depicted in Figures 1 and 2, they decouple actors (machines running code to generate experience, e.g. by playing StarCraft) and learners (machines running code to learn/train/update weights of neural network(s)).

I would assume that their TPUs are definitely being used by the Learner (or, likely, multiple Learners). StarCraft 2 itself won't benefit from running on TPUs (and probably would be impossible to even get to run on them in the first place), because the game logic likely doesn't depend on large-scale, dense matrix operations (the kinds of operations that TPUs are optimized for). So, the StarCraft 2 game itself (which only needs to run for the "actors", not for the "learners") is almost certainly running on CPUs.

The actors will still have to run forwards passes through Neural Networks in order to select actions. I would assume that their Actors are still equipped with either GPUs or TPUs to do this more quickly than a CPU would be capable of, but the more expensive backwards passes are not necessary here; only the Learners need to perform those.

I've had a read of the blog post, and it reminded me that they employed supervised learning to create their initial agent, which was based on thousands of human games. So I guess because they have the raw game data already to hand they can train the model on a TPU, and no interaction with an environment would be necessary. — BigBadMe, Feb 06 '19 at 22:20
@BigBadMe Sure, but that's only for the start of their training. They still ran hundreds of years worth of actual RL experience afterwards :) — Dennis Soemers, Feb 07 '19 at 08:17
Yes true. I suspect they had hundreds of virtual environments actually running the games which then hooked into the TPUs to present the game state and get the actions. I'm looking forward to the paper being released to get the full scoop. — BigBadMe, Feb 07 '19 at 11:15

How does DeepMind perform reinforcement learning on a TPU?

1 Answers1