What loss function to use when labels are probabilities?

Question

What loss function is most appropriate when training a model with target values that are probabilities? For example, I have a 3-output model. I want to train it with a feature vector $x=[x_1, x_2, \dots, x_N]$ and a target $y=[0.2, 0.3, 0.5]$.

It seems like something like cross-entropy doesn't make sense here since it assumes that a single target is the correct label.

Would something like MSE (after applying softmax) make sense, or is there a better loss function?

score 10 · Accepted Answer · edited Jun 17 '20 at 09:57

Actually, the cross-entropy loss function would be appropriate here, since it measures the "distance" between a distribution $q$ and the "true" distribution $p$.

You are right, though, that using a loss function called "cross_entropy" in many APIs would be a mistake. This is because these functions, as you said, assume a one-hot label. You would need to use the general cross-entropy function,

$$H(p,q)=-\sum_{x\in X} p(x) \log q(x).$$ $ $

Note that one-hot labels would mean that $$ p(x) = \begin{cases} 1 & \text{if }x \text{ is the true label}\\ 0 & \text{otherwise} \end{cases}$$

which causes the cross-entropy $H(p,q)$ to reduce to the form you're familiar with:

$$H(p,q) = -\log q(x_{label})$$

What loss function to use when labels are probabilities?

1 Answers1