Is it appropriate to use a softmax activation with a categorical crossentropy loss?

Question

I have a binary classification problem where I have 2 classes. A sample is either class 1 or class 2 - For simplicity, lets say they are exclusive from one another so it is definitely one or the other.

For this reason, in my neural network, I have specified a softmax activation in the last layer with 2 outputs and a categorical crossentropy for the loss. Using tensorflow:

model=tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(units=64, input_shape=(100,), activation='relu'))
model.add(tf.keras.layers.Dropout(0.4))
model.add(tf.keras.layers.Dense(units=32, activation='relu'))
model.add(tf.keras.layers.Dropout(0.4))
model.add(tf.keras.layers.Dense(units=2, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

Here are my questions.

If the sigmoid is equivalent to the softmax, firstly is it valid to specify 2 units with a softmax and categorical_crossentropy?
Is it the same as using binary_crossentropy (in this particular use case) with 2 classes and a sigmoid activation, and if so why?

I know that for non-exclusive multi-label problems with more than 2 classes, a binary_crossentropy with a sigmoid activation is used, why is the non-exclusivity about the multi-label case uniquely different from a binary classification with 2 classes only, with 1 (class 0 or class 1) output and a sigmoid with binary_crossentropy loss.

nbro · Accepted Answer · 2021-03-17T08:59:39.590

Let's first recap the definition of the binary cross-entropy (BCE) and the categorical cross-entropy (CCE).

Here's the BCE (equation 4.90 from this book)

$$-\sum_{n=1}^{N}\left( t_{n} \ln y_{n}+\left(1-t_{n}\right) \ln \left(1-y_{n}\right)\right) \label{1}\tag{1},$$

where

$t_{n} \in\{0,1\}$ is the target
$y_n \in [0, 1]$ is the prediction (as produced by the sigmoid), so $1 - y_n$ is the probability that $n$ belongs to the other class

Here's the CCE (equation 4.108)

$$ -\sum_{n=1}^{N} \sum_{k=1}^{K} t_{n k} \ln y_{n k}\label{2}\tag{2}, $$ where

$t_{n k} = \{0, 1\}$ is the target of input $n$ for class $k$, i.e. it's $1$ when $n$ is labelled as $k$ and $0$ otherwise (so it's $0$ for all $K$ except for one of them)
$y_{n k}$ is the probability that $n$ belongs to the class $k$, as produced by the softmax function

Let $K=2$. Then equation \ref{2} becomes

$$ -\sum_{n=1}^{N} \sum_{k=1}^{2} t_{n k} \ln y_{n k} = -\sum_{n=1}^{N} \left( t_{n 1} \ln y_{n 1} + t_{n 2} \ln y_{n 2} \right) \label{3}\tag{3} $$

So, if $[y_{n 1}, y_{n 2}]$ is a probability vector (which is the case if you use the softmax as the activation function of the last layer), then, in theory, the BCE and CCE are equivalent in the case of binary classification. In practice, if you are using TensorFlow, to choose the most suitable loss function for your problem, you could take a look at this answer.

This is brilliant, thank you so much! – user9317212 Mar 17 '21 at 09:39 — user9317212, Mar 17 '21 at 09:39

Is it appropriate to use a softmax activation with a categorical crossentropy loss?

1 Answers1