LSTM - MAPE Loss Function gives Better Results when Data is De-Scaled before Loss Calculation

Question

I am building an LSTM for predicting a price chart. MAPE resulted in the best loss function compared to MSE and MAE. MAPE function looks like this

$$\frac{1}{N}\sum^N_{i=0}\frac{|P_i - A_i|}{A_i}$$

Where $P_i$ is the current predicted value and $A_i$ is the corresponding actual value. In neural network, it is always advised to scale the data between a small range close to zeros such as [0, 1]. In this case scaling range of [0.001, 1] is imperative to remove a possible division by zero.

Due to the MAPE denominator, the close the scaling range is to zero the larger the loss function becomes for a given $|P_i - A_i|$. If on the other hand, the data is de-scaled just before it is inserted in the MAPE function, the same $|P_i - A_i|$ would give a smaller MAPE

Consider a hypothetical example with a batch size of 1, $|P_i - A_i| = 2$ (this is scale indipendent) and $A_i = 200$. Therefore scaled $A_i = 0.04$. The MAPE error loss for the scaled version would be $\frac{2}{0.04} = 50$, and for the unscaled version $\frac{2}{200} = 0.01$

This will mean that the derivative w.r.t each weight of the scaled version will also be larger, therefore making the weights themselves even smaller. Is this correct?

I am concluding that scaling the data when using MAPE will effectively shrink the weights down more than necessary. Is that a good reason why I am seeing significantly better performance with de-scaled MAPE calculation?

Note: I am not keeping the same hyperparameters for scaled and de-scaled MAPE but a Bayesian optimization is performed with both runs, In the later a deeper network was preferred but in the scaled MAPE more regularisation was preferred.

Some expertise on this would be helpful.

LSTM - MAPE Loss Function gives Better Results when Data is De-Scaled before Loss Calculation

0 Answers0