How is division by zero avoided when implementing back-propagation for a neural network with sigmoid at the output neuron?

Question

I am building a neural network for which I am using the sigmoid function as the activation function for the single output neuron at the end. Since the sigmoid function is known to take any number and return a value between 0 and 1, this is causing division by zero error in the back-propagation stage, because of the derivation of cross-entropy. I have seen over the internet it is advised to use a sigmoid activation function with a cross-entropy loss function.

So, how this error is solved?

it's not quite clear what you are asking , how will it cause division by zero error? can you describe more about the model? — thecomplexitytheorist, Jun 02 '18 at 05:36
It has been suggested to me that adding a small constant to the denominator will prevent the divide by zero error. Contrary to he accepted answer, this does cause issues of giving outputs of infinity rather than zero. — rocksNwaves, Jun 08 '20 at 21:41

score 10 · Accepted Answer · 2019-04-07T10:52:38.433

10

Cross entropy loss is given by:

Now as we know sigmoid function outputs values between 0-1, but what you have missed is it cannot output values exactly 0 or exactly 1 as for that to happen sigmoid(z) will have to be + or -infinity.

Although your compiler gives a divide by 0 error, as very small floating point numbers are rounded off to 0, it is practically of no importance as it can happen in 2 cases only:

sigmoid(z) = 0,in which case even though the compiler cannot calculate log(0) (the first term in the equation) it is ultimately getting multiplied by y_i which will be 0 so final answer is 0.
sigmoid(z) = 1,in which case even though the compiler cannot calculate log(1-1) (the second term in the equation) it is ultimately getting multiplied by 1 - y_i which will be 0 so final answer is 0.

There are a few ways to get past this if you don't want the error at all:

Increase the precision of your compiler to float64 or infinity if available.
Write the program in such a way that anything multiplied by 0 is 0 without looking at the other terms.
Write the program in a way to handle such cases in a special way.

Implementation side note: You cannot bypass divide by 0 error with your manual exception handler in most processors (AFAIK) . So you have to make sure the error does not occur at all.

NOTE: It is assumed that the random weight initialisation takes care of the fact that at the beginning of training it does not so happen that $\tilde y$ or $1-\tilde y$ is 0 while the target is exactly the opposite, it is assumed that due to good training that the output is reaching near to the target and thus the 2 cases mentioned above will hold true.

Hope this helps!

edited Apr 07 '19 at 10:52

answered Jun 02 '18 at 09:00

1

Loss you mentioned is logistic loss, not cross entropy loss. Logistic loss assumes binary classification and 0 corresponds to one class and 1 to another. Cross entropy is used for multiple class case and sum of inputs should be equal to 1. Formula is just negative sum of each label multiply by log of each prediction. – Kyrylo Polezhaiev Feb 11 '20 at 10:50
@KyryloPolezhaiev I'm not sure of the terminology either. For example in PyTorch cross entropy loss means softmax loss whereas logistic/cross entropy loss is named as binary cross entropy loss. – Feb 11 '20 at 12:50
Also, if sigmoid returns almost zero it doesn’t mean tgat label y is equal to zero. Same for case when sigmoid return one. Model can miss. That is what happens almost everytime when training is started. – Kyrylo Polezhaiev Feb 17 '20 at 11:09
Sigmoid of z is output of model, y is ground truth label from dataset to compare output with. – Kyrylo Polezhaiev Feb 17 '20 at 11:10
I have run into the OPs problem with the same exact set-up (binary cross entropy loss and logistic sigmoid function). My are not rendering as simply 0 during back propagation, as your answer suggests. I am getting a lot of infinity values when sigmoid(z). The problem isn't trivial, but the solution is. I found it in Giang Tran's answer to this question. Look at the final two lines of his derivation: https://math.stackexchange.com/questions/2503428/derivative-of-binary-cross-entropy-why-are-my-signs-not-right/2503773 – rocksNwaves Jun 07 '20 at 23:25
You are both right and talking about different things. The loss is binary cross entropy, the activation typically used in that final layer for a binary/logistic problem is sigmoid. – David Hoelzer Feb 19 '23 at 10:58

How is division by zero avoided when implementing back-propagation for a neural network with sigmoid at the output neuron?

1 Answers1