I have a binary classification problem where I have 2 classes. A sample is either class 1 or class 2 - For simplicity, lets say they are exclusive from one another so it is definitely one or the other.
For this reason, in my neural network, I have specified a softmax activation in the last layer with 2 outputs and a categorical crossentropy for the loss. Using tensorflow:
model=tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(units=64, input_shape=(100,), activation='relu'))
model.add(tf.keras.layers.Dropout(0.4))
model.add(tf.keras.layers.Dense(units=32, activation='relu'))
model.add(tf.keras.layers.Dropout(0.4))
model.add(tf.keras.layers.Dense(units=2, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
Here are my questions.
If the sigmoid is equivalent to the softmax, firstly is it valid to specify 2 units with a softmax and categorical_crossentropy?
Is it the same as using binary_crossentropy (in this particular use case) with 2 classes and a sigmoid activation, and if so why?
I know that for non-exclusive multi-label problems with more than 2 classes, a binary_crossentropy with a sigmoid activation is used, why is the non-exclusivity about the multi-label case uniquely different from a binary classification with 2 classes only, with 1 (class 0 or class 1) output and a sigmoid with binary_crossentropy loss.