3

I wanted to use True Positive (and True Negative) in my cost function to make to modify the ROC shape of my classifier. Someone told me and I read that it is not differentiable and therefore not usable as a cost function for a neural network.

In the example where 1 is positive and 0 negative I deduce the following equation for True Positive ($\hat y = prediction, y = label$):

$$ TP = \bf(\hat{y}^Ty) $$ $$ \frac{\partial TP}{\partial \bf y} = \bf y $$

The following for True Negative: $$ TN = \bf(\hat{y}-1)^T(y-1) $$ $$ \frac{\partial TN}{\partial \bf y} = \bf \hat y^T -1 $$

The False Positive: $$ FP = - \bf (\hat y^T-1) y $$ $$ \frac{\partial FP}{\partial \bf y} = - \bf ( \hat y^T - 1) $$

The False Negative: $$ FN = \bf \hat y^T (y-1) $$ $$ \frac{\partial FN}{\partial \bf y} = \bf \hat y $$

All equations seem differentiable to me. Can someone explain where I went wrong?

Teymour
  • 127
  • 4

1 Answers1

1

The vector functions for true positive, false positive etc all make use of the "magic" numbers $0$ and $1$ used to represent Boolean values. They are convenience methods that you can use in a numerical library, but you do need to be aware of the fundamental Boolean nature of the data. The $0$ and $1$ values allow the maths for calculating TP et al, but are not fundamental to it, they are a representation.

Your derivations of gradients for the functions you give seem correct, barring the odd typo. However, the gradient doesn't really apply to the value of $\mathbf{y}$, because all components of $\mathbf{y}$ are either $0$ or $1$. The idea that you could increase $\mathbf{y}$ slightly where $\mathbf{\hat{y}}$ is $1$ in order to increase the value of the TP metric slightly has no basis. Instead the only valid changes to make an improvement are to modify FN values from $0$ to $1$ exactly.

You could probably still use your derivations as a gradient for optimisation (it would not be the only time in machine learning that something does not quite apply theoretically but you could still use it in practice). However, you then immediately hit the problem of how the values of $\mathbf{y}$ have been discretised to $0$ or $1$ - that function will not be differentiable, and it will prevent you back propagating your gradients to the neural network weights that you want to change. If you fix that follow-on problem using a smoother function (e.g. a sigmoid) then you are likely to end up with something close to either cross-entropy loss or the perceptron update step.

In other words, although what you have been told is an over-simplification, you will not find a way to improve the performance of your classifier by adding cost functions based directly on TP, FP etc. That is what binary cross-entropy loss is already doing. There are other, perhaps more fruitful avenues of investigation - hyperparameter searches, regularisation, ensembling, and if you have an unbalanced data set then consider weighting the true/false costs.

Neil Slater
  • 28,678
  • 3
  • 38
  • 60
  • If I use the precision as cost function (TP/(TP+FN)) and add a sigmoid function to the inputs, does it have chance to work (not outperform the classical function like cross-entropy but simply have a better result than a random system) ? – Léonard Barras Aug 30 '19 at 08:40
  • @LéonardBarras: That will depend on other details of your design, which I don't know and would be too complex to analyse in comments or extend the current question. Intuitively, I think you will at best end up re-inventing cross-entropy, but you might also end up with something that sort of works but that is strictly worse than cross-entropy. If you want to know for sure, then build it and test it. – Neil Slater Aug 30 '19 at 08:43
  • 1
    @LéonardBarras look up perceptron Learning algo and RBMs too. They work on binary data and update is also discrete (not for RBMs). –  Aug 30 '19 at 08:43