2

Consider AlexNet, which has 1000 output nodes, each of which classifies an image:

enter image description here

The problem I have been having with training a neural network of similar proportions, is that it does what any reasonable network would do: it finds the easiest way to reduce the error which happens to be setting all nodes to 0, as in the vast majority of the time, that's what they'll be. I don't understand how a network where 999 times out of 1000, the node's output is 0, could possibly learn to make that node 1.

But obviously, it's possible, as AlexNet did very well in the 2012 ImageNet challenge. So I wanted to know, how would one train a neural network (specifically a CNN) when for the majority of the inputs the desired value for an output node is 0?

nbro
  • 39,006
  • 12
  • 98
  • 176
Recessive
  • 1,346
  • 8
  • 21

1 Answers1

1

It's the loss function.

I was using squared sum error, which I didn't think would have as a negative effect as it does, and I had to come to the explanation in my own time. Here's why:

From the perspective of the loss function, 999 times out of 1000, the output should be 0, so there will be an inherent massive bias towards 0 for all the output nodes. But this only occurs if the output nodes are actually trained when their desired outputs are 0, which is what happens in the case of the squared sum/mean error. However, in the case of cross-entropy loss, which is explained excellently here, you can see that the only node that receives gradients is the node that should be trained towards 1. This removes the massive bias towards 0, and punishes confident false positives, making it perfect for a classification problem.

As to how you would achieve something like this for regression I do not know, but at least this solves the issue for classification problems.

Recessive
  • 1,346
  • 8
  • 21