1

I have read and used Gradient Accumulation as a method to handle large batch size on smaller memory restrictions. It is described as following:

for step, eachBatch in enumerate(dataloader):
   ... 
   loss = loss_func(ytrue, ypred)
   loss.backward()
   if step % 5 ==0: 
      # update weights every 5 steps
      loss.step()
      loss.zero_grad()

However, instead of accumulating the gradient, can we accumulate loss values? Just like multi-task trainings, but I have never seen people recommending the following method, but wonder why it that the case, what are the potential problems with this. If there really exist a problem with the following method, then why does loss accumulation work for multi-task training?

totalLoss = 0
for step, eachBatch in enumerate(dataloader):
   ... 
   loss = loss_func(ytrue, ypred)
   totalLoss += loss # accumulate the loss
   if step % 5 ==0: 
      totalLoss.backward()
      totalLoss.step()
      totalLoss.zero_grad()
      totalLoss=0
nbro
  • 39,006
  • 12
  • 98
  • 176
LSM
  • 11
  • 2

1 Answers1

2

Accumulating the loss like that doesn't improve the memory requirements, because the memory consumption depends on the size of your computational graph. In other words, each time you add a term to the loss, the overall function of your loss grows as well, and with it grows the memory consumption.

Specifically, every time you add a term to the loss value, you run the forward step with a new input/output pair $x_i; y_i$. In order to backpropagate, all the hidden activations $h_{i,l}$ must be stored until you call loss.backward(). So with each new loss added, that cache of latent activation vectors grows. Check out PyTorchs walkthrough of what exactly happens there.

In contrast, when you have already computed the gradient, then the cached activation vectors can be deleted and the only things that are saved are the gradient values for your model parameters. In comparison, accumulating the loss would lead to a memory consumption of $\mathcal{O}(n * I)$ and accumulating gradients is $\mathcal{O}(n + |\theta|) \sim \mathcal{O}(n)$, where $n$ is your single-pass memory requirement, $I$ is the number of accumulation steps, and $|\theta|$ is the number of model parameters. I hope that was somehow clear.

Chillston
  • 1,501
  • 5
  • 11
  • So in short, loss accumulation does not achieve the effect of reduce memory requirement. And in the case of multi-task training, what is your opinion if I change loss accumulation to gradient accumulation? – LSM Nov 30 '22 at 09:58
  • 1
    It is mathematically equivalent whether you backpropagate a loss $L(x, y) = l_1(x, y) + l_2(x, y)$, or you accumulate the gradients from separate backpropagation of $l_1$ and $l_2$. So if you can save memory using gradient accumulation I'd say go for it :) – Chillston Nov 30 '22 at 10:46