0

I have a relatively small data set comprised of $3300$ data points where each data point is a $13$ dimensional vector where the $12$ first dimensions depict a "category" by taking the form of $[0,...,1,...,0]$ where $1$ is in the $i-th$ position for the $i-th$ category and the last dimension is an observation of a continuous variable, so typically one data point would be $[1,...,0,70.05]$. I'm not trying to have something extremely accurate so I went with a Fully Connected Network with two hidden layers each comprising two neurons, the activation functions are ReLus, one neuron at the output layer because I'm trying to predict one value, and I didn't put any activation function for it. The optimizer is ADAM, the loss is the MSE while the metric is the RMSE.

I get this learning curve below: Learning curve Eventhough at the beginning the validation loss is lesser than the training loss (which I don't understand), I think at the end it show no sign of overfitting.

What I don't understand is why my Neural Network predicting the same value as long as the $13-th$ dimension takes values greater than $5$ and that value is $0.9747201$. If the $13-th$ dimension takes for example $4.9$ then the prediction would be $1.0005863$. I thought that it has something to do with the ReLu but even when I switched to Sigmoid, I have this "saturation" effect. The value is different but I still get the same value when I pass a certain threshold.

EDIT: I'd also like to add that I get this issue even with normalizing the 13th dimension (substracting the mean and dividing by the standard deviation).

I'd like to add that all the values in my training and validation set are at least greater than $50$ if that may help.

Daviiid
  • 563
  • 3
  • 15

1 Answers1

1

two hidden layers each comprising two neurons

From your description it looks like that you only have 6 parameters for your inner layer (2x2 weight matrix + 2 biases). The whole network should be easy to interpret: you've got two 13-dimensional weight vectors $\vec{w}_1,\vec{w}_2$ that are dot-multiplied with the inputs, plus two biases $b$ and activation $\sigma$:

$$ l_1 = \sigma\left(\vec{w}_1\vec{x} + b_1\right)$$ $$ l_2 = \sigma\left(\vec{w}_2\vec{x} + b_2\right)$$

Then these two values are multiplied by 2x2 matrix + biases, then activation and linear combination.

I'd look at how $l_1$ and $l_2$ are distributed. The fact that the outputs don't change is most likely due to the first layer getting saturated somehow. Look at the 13th dimension of $\vec{w}_i$ - it is likely to be large compared to other dimensions.

First thing I'd try - standardizing your input 13th dimension, so it is distributed closer to $[0,1]$ ( or $[-1,1]$ ) range.

Kostya
  • 2,416
  • 7
  • 23
  • Thank you for your answer. For $w_{1}$ the largest value is at the 4th dimension then the 13th one. For $w_{2}$ it is the same but we can't say that the value of 13th dimension is really greater than the others, sometimes it's just the double. I get this issue while I have standardized my 13th dimension input sorry for not precising it in my post, I'll edit it. Do you have any other idea for solving this issue ? I'd like to add that I get null outputs from the beginning. I've taken the initialized weights and computed the outputs of the first layer. I got $0$s after applying the ReLu. – Daviiid Apr 13 '21 at 03:54
  • By the way can I ask you please how is the loss decreasing with each epochs while the outputs from the first layer are from the beginning of the training null ? Since the biases are initialized to zero too, it means that all outputs are zero. No weights would be updated am I wrong ? – Daviiid Apr 13 '21 at 03:57
  • @Daviiid (For future: I'd suggest you avoid using the word "null" when talking bout zero values.) It seems that you are dealing with the "dying ReLU problem". Since you have only two neurons both of them likely get dead. In this case, the biases in 2nd layer and the output layer's linear coefficients would still able to train and converge to some average values, not accounting for your inputs. – Kostya Apr 13 '21 at 12:42
  • you're right I shouldn't use "null" when talking about zero values. I don't understand quite well how the other layers' coefficients will train. From the first layer we get a $[l_{1} l_{2}]$, then if $W_{2}$ denotes the weights matrix for the second layer and $b_{2}$ the corresponding bias vector we get $W_{2}*[l_{1} l_{2}] + b_{2}$ but since the biases are initialized to zero and $[l_{1} l_{2}]=[0 0]$ we get zeros outputs through the second layer as well am I wrong ? – Daviiid Apr 13 '21 at 13:58
  • @Daviiid The outputs of the second are going into the final layer, which, as I understand, is another linear function. So, you'll have two more non-zero initialized weights there. (If I understood your description correctly) – Kostya Apr 13 '21 at 14:49
  • (I'd like to apologize first because I don't know how to fix the way your name isn't displayed after the @) The second layer contains 2 neurons, if we denote $W_{3}$ the matrix of weights for the final layer, $b_{3}$ the bias of the corresponding layer and $s_{1}, s_{2}$ the outputs of the second layer, we have for the final layer $W_{3}*[s_{1} s_{2}]+b_{3}$ but since $s_{1}, s_{2}$ are zeros we still get no update. Well I think so but I may be wrong. – Daviiid Apr 13 '21 at 16:00
  • @Daviiid You are right - if all the biases are 0, then you have to have 0 at the output. Are you sure your biases are 0 at the initialization? I'm afraid I'd have to look at your code to figure it out. I suggest you make a question focused on that. – Kostya Apr 13 '21 at 17:29
  • Thank you for the advice, I shall take it and make a question focused on it right now. – Daviiid Apr 14 '21 at 00:58