3

In A3C, there are several child processes and one master process. The child precesses calculate the loss and backpropagation, and the master process sums them up and updates the parameters, if I understand it correctly.

But I wonder how I should decide the number of the child processes to implement. I think the more child processes are, the better it is to disentangle the correlation between the samples is, but I'm not sure what is the cons of setting a large number of child processes.

Maybe the more child processes are, the larger the variance of the gradient is, leading to the instability of the learning? Or is there any other reason?

And finally, how should I decide the number of the child processes?

nbro
  • 39,006
  • 12
  • 98
  • 176
Blaszard
  • 1,027
  • 2
  • 11
  • 25

1 Answers1

3

The correct number of child processes will depend on the hardware available to you.

Simplifying a bit, child processes can be in one of two states: waiting for memory or disk access, or running.

If your problem fits nicely in your computers' memory, then processes will spend almost all of their time running. If it's too big for memory, they will periodically need to wait for disk.

You should use approximately 1 Child process per CPU core. If you are training on a GPU, then it depends whether the process can make use of the entire GPU at once (in which case, use just 1), or whether a "process" is really more like a CUDA thread here (in which case you'd want one per CUDA core).

If you think your processes will wait for disk, use more than one per core. About 50% more is a good starting point. You can use a program like top to monitor CPU usage and adjust the number of processes accordingly.

To answer your question more explicitly:

  • Having more child processes (up to a point, discussed above), will increase hardware utilization, and make your training run faster. With a Core i7 CPU for instance, you might be able to run 8 or 16 child processes at a time, so you'd train 8-16x times faster.
  • Having more child processes than processing units (CPU cores, CUDA cores), will begin to cause frequent context switching, where the processing units have to pause to change between different jobs. Changing jobs is extremely expensive, and ultimately, your program cannot train faster than it would by using all the available hardware. If you have more processes than processing units, reducing the number should make your program train faster.
John Doucette
  • 9,147
  • 1
  • 17
  • 52