Why do we average gradients and not loss in distributed training?

Question

I'm running some distributed trainings in Tensorflow with Horovod. It runs training separately on multiple workers, each of which uses the same weights and does forward pass on unique data. Computed gradients are averaged within the communicator (worker group) before applying them in weight updates. I'm wondering - why not average the loss function across the workers? What's the difference (and the potential benefits) of averaging gradients?

It may be because, if you average the losses, you may introduce more numerical errors, which will be propagated to the computation of the gradient, which involves more operations than the computation of the average of the gradients. — nbro, Dec 31 '19 at 23:39
Following on from the other answer, consider that averaging defeats the purpose if the loss is very volatile. e.g. if one has a low loss reading on one epoch and a high loss reading on another, the average will not capture the volatility in loss readings. — Michael Grogan, Jan 03 '20 at 17:35

score 2 · Answer 1 · edited Jan 03 '20 at 21:17

The whole idea behind those distributed optimization methods is that data should be local in every node/worker. Thus, if you only send the loss value to the central node, this node can't compute the gradients of this loss, and thus can't do any training. However, if you don't want to send gradients, a family of distributed optimization algorithms called consensus-based optimization can be used to only send the local weight of the model to neighbouring nodes, and those nodes use their local gradient and the models from their neighbours to update their local models.

Why do we average gradients and not loss in distributed training?

1 Answers1