Highest Voted 'distributed-computing' Questions - Artificial Intelligence Stack Exchange

5

votes

1 answer

Why do we average gradients and not loss in distributed training?

I'm running some distributed trainings in Tensorflow with Horovod. It runs training separately on multiple workers, each of which uses the same weights and does forward pass on unique data. Computed gradients are averaged within the communicator…

tensorflow distributed-computing

asked Dec 31 '19 at 13:22

pSoLT

161
2

4

votes

2 answers

How can we use crowdsourcing for deep learning?

Most companies dealing with deep learning (automotive - Comma.ai, Mobileye, various automakers, etc.) do collect large amounts of data to learn from and then use lots of computational power to train a neural network (NN) from such big data. I guess…

neural-networks machine-learning deep-learning distributed-computing

asked Jan 16 '17 at 12:17

Kozuch

271
1
6

2

votes

2 answers

Why do LLMs need massive distributed training across nodes -- if the models fit in one GPU while batch decreases the variance of gradients?

Why do large language models (LLMs) need massive distributed training across nodes -- if the models fit in one GPU and larger batch only decreases the variance of gradients? tldr: assuming for models that don't need sharding across nodes, why do we…

machine-learning deep-learning training distributed-computing large-language-models

asked Feb 16 '23 at 17:45

Charlie Parker

161
2
5

2

votes

1 answer

In how few updates can a multi layer neural net be trained?

A single iteration of gradient descent can be parallelised across many worker nodes. We simple split the training set across the worker nodes, pass the parameters to each worker, each worker computes gradients for their subset of the training set,…

deep-neural-networks distributed-computing

asked Aug 30 '19 at 02:55

is8ac

41
2

1

vote

1 answer

Why can't we train neural networks in a peer-to-peer manner?

I have recently been exposed to the concept of decentralized applications, I know that neural networks require a lot of parallel computing infra for training. What are the technical difficulties one may face for training neural networks in a p2p…

neural-networks training distributed-computing

asked Jul 18 '20 at 21:21

ram bharadwaj

1

vote

0 answers

Do I need to maintain a separate population in each distributed environment when implementing PBT in a MARL context?

I have questions regarding on how to implement PBT as described in Algorithm 1 (on page 5) in the paper, Population Based Training of Neural Networks to train agents in a MARL (multi-agent reinforcement learning) environment. In a single agent RL…

reinforcement-learning multi-agent-systems distributed-computing

asked Apr 25 '20 at 05:40

Huan

161
1
6

0

votes

0 answers

How to get optimal Scaling with raw PyTorch+DDP?

I'm trying to install a distributed training environment on a compute cluster that I have. I happen to know from previous experience that often scaling up the batch size "naively" isn't very useful; my experience matches the motivation behind AdaSum…

pytorch distributed-computing

asked Mar 24 '23 at 20:55

profPlum

360
1
9

Questions tagged [distributed-computing]

Why do we average gradients and not loss in distributed training?

How can we use crowdsourcing for deep learning?

Why do LLMs need massive distributed training across nodes -- if the models fit in one GPU while batch decreases the variance of gradients?

In how few updates can a multi layer neural net be trained?

Why can't we train neural networks in a peer-to-peer manner?

Do I need to maintain a separate population in each distributed environment when implementing PBT in a MARL context?

How to get optimal Scaling with raw PyTorch+DDP?