I'm trying to install a distributed training environment on a compute cluster that I have. I happen to know from previous experience that often scaling up the batch size "naively" isn't very useful; my experience matches the motivation behind AdaSum Algorithm.
However on the new machine that I'm using I have not been able to get Horovod (& implicitly AdaSum) to work. I've only been able to get pytorch+lightning DDP strategy working. Given that: is there some easy way to get scaling performance comparable to AdaSum by tweaking pytorch-DDP strategy setup?
In particular: it occurred to me that using gradient clipping SEEMS similar to AdaSum (in the sense that redundant gradient vectors pointing the same direction are given less weight). I'm wondering would this simple method give me comparable performance?
On the other hand if there is something obvious about Pytorch's DDP strategy that I'm missing which means I do not need to worry about an AdaSum alternative please let me know.