For questions related to the softmax function, which a function that takes as input a vector of K real numbers, and normalizes it into a probability distribution consisting of K probabilities proportional to the exponentials of the input numbers. The softmax is often used as the activation function of the output layer of a neural network.
Questions tagged [softmax]
36 questions
17
votes
2 answers
Are softmax outputs of classifiers true probabilities?
BACKGROUND: The softmax function is the most common choice for an activation function for the last dense layer of a multiclass neural network classifier. The outputs of the softmax function have mathematical properties of probabilities and are--in…

Snehal Patel
- 912
- 1
- 1
- 25
6
votes
2 answers
Why does TensorFlow docs discourage using softmax as activation for the last layer?
The beginner colab example for tensorflow states:
Note: It is possible to bake this tf.nn.softmax in as the activation function for the last layer of the network. While this can make the model output more directly interpretable, this approach is…

galah92
- 163
- 5
5
votes
2 answers
What is the advantage of using cross entropy loss & softmax?
I am trying to do the standard MNIST dataset image recognition test with a standard feed forward NN, but my network failed pretty badly. Now I have debugged it quite a lot and found & fixed some errors, but I had a few more ideas. For one, I am…

Ben
- 425
- 3
- 10
5
votes
1 answer
Which paper introduced the term "softmax"?
Nowadays, the softmax function is widely used in deep learning and, specifically, classification with neural networks. However, the origins of this term and function are almost never mentioned anywhere. So, which paper introduced this term?

nbro
- 39,006
- 12
- 98
- 176
4
votes
1 answer
Why are policy gradient methods more effective in high-dimensional action spaces?
David Silver argues, in his Reinforcement Learning course, that policy-based reinforcement learning (RL) is more effective than value-based RL in high-dimensional action spaces. He points out that the implicit policy (e.g., $\epsilon$-greedy) in…

Saucy Goat
- 143
- 4
2
votes
1 answer
Why do we use the softmax instead of no activation function?
Why do we use the softmax activation function on the last layer?
Suppose $i$ is the index that has the highest value (in the case when we don't use softmax at all). If we use softmax and take $i$th value, it would be the highest value because $e$ is…

dato nefaridze
- 862
- 6
- 20
2
votes
2 answers
What do the authors of this paper mean by the bias term in this picture of a neural network implementation?
I am reading a paper implementing a deep deterministic policy gradient algorithm for portfolio management. My question is about a specific neural network implementation they depict in this picture (paper, picture is on page 14).
The first three…

Mike
- 141
- 4
1
vote
1 answer
Dealing with noise in models with softmax output
I have a device with an accelerometer and gyroscope (6-axis). The device sends live raw telemetry data to the model 40 samples for each input, 6 values per sample (accelerometer xyz, gyroscope xyz). The model predicts between 12 different labels of…

Sterling Duchess
- 113
- 3
1
vote
1 answer
Number of units in Final softmax layer in VGGNet16
I am trying to implement and train neural network model VGGNet from scratch, on my own data. I am reproducing all the layers of the model. I am having a confusion about the last, fully connected softmax layer.
In the research paper by Simonyan and…

Dawood Ahmad
- 13
- 3
1
vote
2 answers
Backpropagation with CrossEntropy and Softmax, HOW?
Let Zs be the input of the output layer (for example, Z1 is the input of the first neuron in the output layer), Os be the output of the output layer (which are actually the results of applying the softmax activation function to Zs, for example, O1 =…

qazaq
- 11
- 2
1
vote
1 answer
Why are SVMs / Softmax classifiers considered linear while neural networks are non-linear?
My understanding is that neural networks are definitely not linear classifiers, as the point of functions like ReLU is to introduce non-linearity.
However, here's where my understanding starts to break down. A classifier, like Softmax or SVM is…

Foobar
- 151
- 5
1
vote
1 answer
Trouble writing the backpropagation algorithm in python through crossentropy and softmax
so I am writing my own neural network library for a class project and I got everything working for a simple 2-class test using the distance (L2) cost function. I wanted to get a similar result using softmax and crossentropy instead.
I did the…

user605734 MBS
- 121
- 5
1
vote
0 answers
Use soft-max post-training for a ReLU trained network?
For a project, I've trained multiple networks for multiclass classification all ending with a ReLU activation at the output.
Now the output logits are not probabilities.
Is it valid to get the probability of each class by applying a softmax function…

user452306
- 21
- 3
1
vote
1 answer
Is it normal that the values of the LogSoftmax function are very large negative numbers?
I have trained a classification network with PyTorch lightning where my training step looks like below:
def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self(x)
loss = F.cross_entropy(y_hat, y)
self.log("train_loss",…

pd109
- 125
- 4
1
vote
1 answer
Is it appropriate to use a softmax activation with a categorical crossentropy loss?
I have a binary classification problem where I have 2 classes. A sample is either class 1 or class 2 - For simplicity, lets say they are exclusive from one another so it is definitely one or the other.
For this reason, in my neural network, I have…

user9317212
- 161
- 2
- 10