Why can't we train neural networks in a peer-to-peer manner?

Question

I have recently been exposed to the concept of decentralized applications, I know that neural networks require a lot of parallel computing infra for training.

What are the technical difficulties one may face for training neural networks in a p2p manner?

I think the transfer of data will take too long compared to similar operations in existing solutions. — , Jul 18 '20 at 21:56
what would be the purpose of doing so? e.g. for data privacy, for speed, etc — benbyford, Jun 30 '21 at 10:12

Brian O'Donnell · Answer 1 · 2020-07-21T11:37:51.087

Data management and bandwidth are key issues for interconnecting multiple GPUs. These are such big issues that it is hard to think about other challenges like neural network architecture, metrics, etc. The key to success for interconnecting multiple GPUs on a single computer is NVIDIA's NVLink:

NVLink is a wire-based communications protocol for near-range semiconductor communications developed by Nvidia that can be used for data and control code transfers in processor systems between CPUs and GPUs and solely between GPUs. NVLink specifies a point-to-point connection with data rates of 20 and 25 Gbit/s (v1.0/v2.0) per differential pair.

Compare 25 Gbit/s to a typical peer to peer connection over the web of 100Mbps. NVLINK provides a 250x advantage assuming everything else is equal which it is not. This means that, considering bandwidth only, a neural network which takes one day to train on a computer with two GPUs connected with NVLINK could take 250 days over the internet using two computers with the same GPU!

Does it mean the neural network training is an iterative process (every tweak I made to the network influences next training step) and thus is not possible to split the dataset into parts that could be trained independently and then merged? — Kozuch, Sep 25 '20 at 15:42

Why can't we train neural networks in a peer-to-peer manner?

1 Answers1