When would it make sense to perform a gradient descent step for each term of a loss function with multiple terms?

Question

I am training a neural network using a mini-batch gradient descent algorithm.

Now, consider the following loss function, which is composed of 2 terms.

$$L = L_{\text{MSE}} + L_{\text{regularization}} \label{1}\tag{1}$$

As far as I understand, usually, we update the weights of a neural network only once per mini-batch, even if the loss function is composed of 2 or more terms, like in equation \ref{1}. So, in this approach, you calculate the 2 terms, add them, and then update weights once based on the sum.

My question is: rather than summing the 2 terms of the loss function $L$ in equation \ref{1} and computing a single gradient for $L$, couldn't we separately compute the gradient both for $L_{\text{MSE}}$ $L_{\text{regularization}}$, then update the weights of the neural network twice? So, in this case, we would update the weights twice for each mini-batch. When would this make sense? Of course, my question also applies to the case where $L$ is composed of more than 2 terms.

serali · Answer 1 · 2021-10-15T07:08:20.370

1

I am not sure if the process defined in the question is meaningful at all. If you mean to simply add the contribution of each $L$ without running the algorithm for the mini-batch, it makes no difference at all if you make one update or more; as the loss functions' contribution are simply added in the update. If on the other hand you mean to run the algorithm on the mini-batch once for each $L$ before calculating its contribution including the constraints, it would most probably end up with a step in the wrong direction. Constraints (at least $L_1$ and $L_2$) do not depend on the input or the output of the model, they do not compare the prediction with the target value; they are functions of the weights only. There is no reason to expect trying to minimize this function by itself will improve the model; it will actually harm the progress by taking a (small or large) step in the wrong direction.

edited Oct 15 '21 at 07:08

answered Oct 13 '21 at 09:27

serali

890
6
16

So, $k$ updates for each mini-batch is totally wrong? – hanugm Oct 13 '21 at 11:21
@hanugm It is k updates for each batch, where batch means the entire dataset. One update for each mini-batch. – serali Oct 13 '21 at 11:30
Oh. In my question, I used the word "batch" instead of "mini-batch". – hanugm Oct 13 '21 at 11:38
Sorry to inform you that we made an edit. Please read it again if possible, – hanugm Oct 13 '21 at 12:55

score 1 · Answer 2 · answered Oct 13 '21 at 19:54

Technically, nothing prevents you from doing so. When you have mulitple losses, you may call .backward() at each term separately.

However, I wonder, whether it makes sense to optimize each individual path as a separate objective, since if we have multiple of them - we would like to solve several tasks simultaneously.

Probably, it could be beneficial as some kind of regularization. One makes steps away from the gradient, but overall in the direction, which makes the model less prone to overfitting. But the choice of the batch size, learning rate - seem more straightforward way to achieve this. There is also a ~2x times additional computational overhead since we backpropagate twice.

In some sense, It is done in the training of GAN's. Instead of backpropagating for discriminator and generator simultaneously - one calculates the loss separately for each model and updates weights.

When would it make sense to perform a gradient descent step for each term of a loss function with multiple terms?

2 Answers2