Why does my model overfit on pseudo-random numbers training data?

Question

I am trying to predict pseudo-random numbers using the past numbers with a multiplayer perceptron. The error while training is very low. However, as soon as I test it with a test set, the model overfits and returns very bad results. The correlation coefficient and error metrics are both not performing well.

What would be some of the ways to solve this issue?

For example, if I train it with 5000 rows of data and test it with 1000, I get:

Correlation coefficient                  0.0742
Mean absolute error                      0.742 
Root mean squared error                  0.9407
Relative absolute error                146.2462 %
Root relative squared error            160.1116 %
Total Number of Instances             1000

As mentioned, I can train it with as many training samples as I want and still have the model overfits. If anyone is interested, I can provide/generate some data and post it online.

Did you mention your data is pseudo-random? If that is the case the network must overfits as currently we cannot predict pseudo random numbers using neural network, or this will break all current cryptography in our society — Clement, Jan 05 '20 at 22:24
@ClementHui excellent point and you guessed the topic in question too, the output is just a single number from -1 to 1. Regardless, I would like to learn if there are any solutions that might even slightly suggest improved results. — Damir Olejar, Jan 05 '20 at 22:32
@ClementHui lets say yes, however, I can filter the noise out... in that case any suggestion on which filter to apply (or any other method) ? — Damir Olejar, Jan 05 '20 at 22:35
Please refer to this https://ai.stackexchange.com/q/3850/23713 — Clement, Jan 05 '20 at 22:37
Btw is this a possible duplicate of https://ai.stackexchange.com/q/3850/23713 @nbro — Clement, Jan 06 '20 at 02:41
@ClementHui Not a duplicate, but for the simplicity's sake I said that it is (pseudo) random given the noisy data. — Damir Olejar, Jan 06 '20 at 04:51

score 1 · Accepted Answer · edited Jan 06 '20 at 19:26

Simply said, predicting pseudo random number is just not possible for now. Pseudo random numbers generated now have a high enough "randomness" so that it cannot be predicted. Pseudo random numbers is the basis of modern cryptography which is widely used in the world wide web and more. It may be possible in the future through faster computers and stronger AI, but for now it is not possible. If you train a model to fit on pseudo random numbers, the model will just overfit and thus creating a scenario as shown in the question. The training loss will be very low while test loss will be extremely high. The model will just "remember" the training data instead of generalising to all pseudo random numbers, thus the high test loss.

Also, as a side note, loss is not represented by %, instead it is just a raw numeric value.

See this stack exchange answer for details.

>"Also, as a side note, loss is not represented by %, instead it is just a raw numeric value." Thanks, but that came as a default with the software :-) — Damir Olejar, Jan 06 '20 at 04:46

Why does my model overfit on pseudo-random numbers training data?

1 Answers1