4

I have trained a RNN, GRU, and LSTM on the same dataset, and looking at their respective predictions I have observed, that they all display an upper limit on the value they can predict. I have attached a graph for each of the models, which shows the upper limit quite clearly. Each dot is a prediction, and the orange graph is simply there to illustrate the ground truth (i.e. ground truth on both axis).

RNN model predictions GRU model predictions LSTM model predictions My dataset is split in 60% for training, 20% for test, and 20% for validation and then each of the splits are shuffled. The split/shuffle is the same for all three models, so each model uses the exact same split/shuffle of data for its predictions too. The models are quite simple (2 layers, nothing fancy going on). I have used grid search to find the most optimal hyperparameters for each model. Each model is fed 20 consecutive inputs (a vector of features, e.g. coordinates, waiting time, etc) and produces a single number as output which is the expected remaining waiting time. I know this setup strongly favours LSTM and GRU over RNN, and the accuracy of the predictions definitively shows this too.

However, my question is why do each model display an upper limit on its predictions? And why does it seem like such a hard limit?

I cannot wrap my head around what the cause of this is, and so I am not able to determine whether it has anything to do with the models used, how they are trained, or if it is related to the data. Any and all help is very much appreciated!


Hyperparameters for the models are:

RNN: 128 units pr layer, batch size of 512, tanh activation function

GRU: 256 units pr layer, batch size of 512, sigmoid activation function

LSTM: 256 units pr layer, batch size of 256, sigmoid activation function

All models have 2 layers with a dropout in between (with probability rate 0.2), use a learning rate of $10^{-5}$, and are trained over 200 epochs with early stopping with a patience of 10. All models use SGD with a momentum of 0.8 , no nesterov and 0.0 decay. Everything is implemented using Tensorflow 2.0 and Python 3.7. I am happy to share the code used for each model if relevant.


EDIT 1 I should point out the graphs are made up of 463.597 individual data points, most of which are placed very near the orange line of each graph. In fact, for each of the three models, of the 463.597 data points, the number of data points within 30 seconds of the orange line is:

RNN: 327.206 data points

LSTM: 346.601 data points

GRU: 336.399 data points

In other words, the upper limit on predictions shown on each graph consists of quite a small number of samples compared to the rest of the graph.

EDIT 2 In response to Sammy's comment I have added a graph showing the distribution of all predictions in 30 second intervals. The y-axis represents the base 10 logarithm of the number of samples which fall into a given 30 second interval (the x-axis). The first interval ([0;29]) consists of approximately 140.000 predicted values, out of the roughly 460.000 total number of predicted values. LSTM distribution of predictions

  • Hi Kornephoros, a couple of questions: 1. Have you applied any normalization/standardization? 2. Are the plots based on train or validation data? And do both datasets show this pattern? 3. Am I right to assume that you did use MSE as a loss function? – Jonathan Jan 17 '20 at 14:51
  • Hi Sammy. Yes, I have applied Min-Max normalization to each input feature individually. The data used for predictions is the test-set, which neither of the models has seen before. The plots are based on these predictions. I only have a single dataset, but it is split into training, validation, and test in a 60/20/20 split. And yes, you are right, I have used MSE as a loss function for all three models. – Kornephoros Jan 17 '20 at 19:05
  • And how do these plots look like for training data? – Jonathan Jan 17 '20 at 21:38
  • I do not have the graphs for the training data and unfortunately I am not able to produce these graphs at this moment. Can you tell me if you have any suspicion of what the cause may be? – Kornephoros Jan 20 '20 at 07:47
  • 1
    My guess would be that it is related to your asymmetric distribution of output values. As one can assume from the fact that you're looking at waiting times and from the graph, your outout values come from an asymmetric distribution rather similar to poison than normal distribution. Which could mean your model is biased to predict lower output values. I'd check where you're standing in terms of model capacity and overfitting/underfitting by looking at learning curves of train and valid data. Moreover, double check if your 3 subdatasets have a similar distribution of output values. – Jonathan Jan 20 '20 at 14:15
  • @Sammy I don't think you could be anymore spot on w.r.t. the distribution of data points. I have edited my question to include a graph showing the distribution of predictions in 30 second intervals. This clearly shows an asymmetric distribution, heavily biased towards the lower output values. Could this bias also explain why each of the three models seem to display a "hard" upper limit on their predicted values? – Kornephoros Jan 20 '20 at 20:12
  • I am not sure. Imbalanced datasets are known to be biased towards lower performance for the rare data points. Usually that is a topic in classification (e.g fraud detection) but you can have the same effect in regression. To test this hypothesis I'd check the learning curves and datasplit as pointed out above. However, the hard stop could also be an artifact. You mentioned the activation functions that you apply, e.g. sigmiod for LSTM. However, LSTM cells usually use sigmoid and tanh. Can elaborate more on your network architecture or post the corresponding code? – Jonathan Jan 21 '20 at 12:44

0 Answers0