I am building an LSTM
for predicting a price chart. MAPE
resulted in the best loss function compared to MSE
and MAE
. MAPE
function looks like this
$$\frac{1}{N}\sum^N_{i=0}\frac{|P_i - A_i|}{A_i}$$
Where $P_i$ is the current predicted value and $A_i$ is the corresponding actual value. In neural network, it is always advised to scale the data between a small range close to zeros such as [0, 1]
. In this case scaling range of [0.001, 1]
is imperative to remove a possible division by zero.
Due to the MAPE
denominator, the close the scaling range is to zero the larger the loss function becomes for a given $|P_i - A_i|$. If on the other hand, the data is de-scaled just before it is inserted in the MAPE
function, the same $|P_i - A_i|$ would give a smaller MAPE
Consider a hypothetical example with a batch size of 1, $|P_i - A_i| = 2$ (this is scale indipendent) and $A_i = 200$. Therefore scaled $A_i = 0.04$. The MAPE
error loss for the scaled version would be $\frac{2}{0.04} = 50$, and for the unscaled version $\frac{2}{200} = 0.01$
This will mean that the derivative w.r.t each weight of the scaled version will also be larger, therefore making the weights themselves even smaller. Is this correct?
I am concluding that scaling the data when using MAPE
will effectively shrink the weights down more than necessary. Is that a good reason why I am seeing significantly better performance with de-scaled MAPE
calculation?
Note: I am not keeping the same hyperparameters for scaled and de-scaled MAPE
but a Bayesian optimization is performed with both runs, In the later a deeper network was preferred but in the scaled MAPE
more regularisation was preferred.
Some expertise on this would be helpful.