Does it make sense to apply softmax on top of relu?

Question

While working through some example from Github I've found this network (it's for FashionMNIST but it doesn't really matter).

Pytorch forward method (my query in upper case comments with regards to applying Softmax on top of Relu?):

def forward(self, x):
    # two conv/relu + pool layers
    x = self.pool(F.relu(self.conv1(x)))
    x = self.pool(F.relu(self.conv2(x)))

    # prep for linear layer
    # flatten the inputs into a vector
    x = x.view(x.size(0), -1)

    # DOES IT MAKE SENSE TO APPLY RELU HERE
    **x = F.relu(self.fc1(x))

    # AND THEN Softmax on top of it ?
    x = F.log_softmax(x, dim=1)**

    # final output
    return x

Yes absolutely...Infact I would have probably put another relu on top of it...The softmax is for classification and training purposes. — , Oct 18 '18 at 10:47

Neil Slater · Answer 1 · 2018-10-18T12:56:46.763

Does it make sense?

In general, yes it is interpretable, back propagation will work, and the NN can be optimised.

By using ReLU, the default network has a minimum logit of $0$ for the softmax input, which means at least initially that there will be higher minimum probabilities associated with all classes (compared to allowing negative logits which would happen randomly with usual weight initialisation). The network will need to learn to produce higher logit values for correct answers, because it has no ability to produce lower logit values for incorrect answers. This is like training a network to produce the highest regression value on one output, whilst clipping all values to be 0 or above, so it does not have the option of making one output e.g. $-1.0$ and the rest $-100.0$

It can probably be thought of as a type of regularisation, as it puts constraints on activation values that will work.

Is it needed?

That is less clear. You can try training with and without the line, and using cross-validation or a test set to see if there is a significant difference.

If the network has been designed well, then I'd expect to see a slight improvement with the added ReLU.

If it is a mistake, then I'd expect to see no difference, or better performance without the ReLU.

The relu actually doesn't do anything I think....The previous inputs are all positive so it'll all pass through relu with only a linear transformation....So I think another relu on top will add some non linearity. — , Oct 18 '18 at 13:45
@DuttaA: The weights can be negative between layers though, so one layer being all 0 or positive does not mean next layer will be. Otherwise you would only need one ReLU layer for the whole network . . . also ReLU can only add meaningful non-linearity when there are negative inputs. An all-positive input is unchanged by ReLU. You cannot use ReLU to "add non-linearity" to a vector of all positive numbers, it would do nothing. — Neil Slater, Oct 18 '18 at 13:59

Does it make sense to apply softmax on top of relu?

1 Answers1