Over- and underestimations of the lowest and highest values in LSTM network

Question

I'm training an LSTM network with multiple inputs and several LSTM layers in order to set up a time series gap filling procedure. The LSTM is trained bidirectionally with "tanh" activation on the outputs of the LSTM, and one Dense layer with "linear" activation comes at the end to predict the outputs. The following scatterplot of real outputs vs the predictions illustrates the problem:

Outputs (X-axis) vs predictions (Y-axis):

The network is definitely not performing too bad and I'll be updating the parameters in the next trials, but the issue at hand always reappears. The highest outputs are clearly underestimated, and the lowest values are overestimated, clearly systematic.

I have tried min-max scaling on inputs and outputs and normalizing inputs and outputs, and the latter performs slightly better, but the issue persists.

I've looked a lot in existing threads and Q&As, but I haven't seen something similar.

I'm wondering if anyone here sees this and immediately knows the possible cause (activation function? Preprocessing? Optimizer? Lack of weights during training? ... ?). Or, and in that case, it would also be good to know if this is impossible to find out without extensive testing.

Try using sigmoid layer in the last layer instead of linear activation. — thecomplexitytheorist, Jul 04 '18 at 19:32

score 1 · Answer 1 · edited Mar 31 '20 at 20:30

1

RNN is a deeply non-linear function over time, how the black linear line is deduced?

Assuming you are doing just linear regression if the least square error was used as the loss function, it will have a probabilistic interpretation

$$y^{(i)}|x^{(i)};\theta \sim \mathcal N(\theta^Tx^{(i)}, \sigma^2)$$

$Y$ is conditioned on $X$ parameterized by $\theta$, with a Gaussian distribution, thus for every data point $x$, there is a corresponding $y$ which if you are doing maximum likelihood estimation is just the mean of the Gaussian, hence variance is introduced to express the noise. if you are doing mini-batch is not guaranteed to reach a global minimum.

Side note: If you normalize the data with min-max scaling, please make sure to use only the train set, if you include the dev/test set, you are doing some kind of data snooping, generalization error will be biased.

edited Mar 31 '20 at 20:30

nbro

39,006
12
98
176

answered May 05 '18 at 17:13

Fadi Bakoura

364
2
6

The black line is a reference 1:1 line. The blue dots are predictions vs. true values. I do use mini-batches. MSE is the loss function and standard adam is the optimizer. Implementation is done in Keras with TF backend. I'm not seeing clearly yet what your answer states? And yes, scaling parameters were calculated based on the training set only. – Kristof May 07 '18 at 07:02
Umm, I see, normalize the inputs to have zero mean and unit variance instead of min-max scaling. augment the LSTM with Mixture Density Outputs as proposed in Generating Sequences With Recurrent Neural Networks (Grave-2013). – Fadi Bakoura May 07 '18 at 07:17
Be careful of the output formulation, if you could define relative outputs instead of absolute real values, it will be way better. – Fadi Bakoura May 07 '18 at 07:25
What do you mean by relative as opposed to absolute real values? min-max scaling or standard scaling was already included in the procedure. – Kristof May 18 '18 at 14:37
Use the relative difference between subsequent timesteps instead of their corresponding absolute values, e.g. x2= x2-x1, don't forget to renormalize them to have zero mean unit variance – Fadi Bakoura May 18 '18 at 14:46

Over- and underestimations of the lowest and highest values in LSTM network

1 Answers1