I have read and used Gradient Accumulation as a method to handle large batch size on smaller memory restrictions. It is described as following:
for step, eachBatch in enumerate(dataloader):
...
loss = loss_func(ytrue, ypred)
loss.backward()
if step % 5 ==0:
# update weights every 5 steps
loss.step()
loss.zero_grad()
However, instead of accumulating the gradient, can we accumulate loss values? Just like multi-task trainings, but I have never seen people recommending the following method, but wonder why it that the case, what are the potential problems with this. If there really exist a problem with the following method, then why does loss accumulation work for multi-task training?
totalLoss = 0
for step, eachBatch in enumerate(dataloader):
...
loss = loss_func(ytrue, ypred)
totalLoss += loss # accumulate the loss
if step % 5 ==0:
totalLoss.backward()
totalLoss.step()
totalLoss.zero_grad()
totalLoss=0