4

I am trying to deploy a machine learning solution online into an application for a client. One thing they requested is that the solution must be able to learn online because the problem may be non-stationary and they want the solution to track the non-stationarity of the problem. I thought about this problem a lot, would the following work?

  1. Set the learning rate (step-size parameter) for the neural network at a low fixed value so that the most recent training step is weighted more.
  2. Update the model only once per day, in a mini-batch fashion. The mini-batch will contain data from the day, mixed with data from the original data set to prevent catastrophic interference. By using a mini batch update, I am not prone to biasing my model to the latest examples, and completely forgetting the training examples from months ago.

Would this set-up be "stable" for online/incremental machine learning? Also, should I set up the update step so it samples data from all distributions of my predicted variable uniformly so it gets an "even" update (i.e., does not overfit to the most probabilistic predicted value)?

nbro
  • 39,006
  • 12
  • 98
  • 176
Rui Nian
  • 423
  • 3
  • 13
  • Is your model going to be a neural network? – nbro Apr 16 '19 at 10:02
  • Ideally, I wanted it to be because the system is quite complex. Is it easier to do for polynomial / linear models? – Rui Nian Apr 16 '19 at 13:36
  • Ideally, the model should be trained only with new examples. This might help - https://datascience.stackexchange.com/questions/12761/should-a-model-be-re-trained-if-new-observations-are-available. – Supratim Haldar Apr 17 '19 at 11:26
  • Hi Supratim, thanks for the reply. However, training with only the new examples cause catastrophic interference, especially for time series processes where states at time t and t + 1 are very similar. Imagine a robot having to learn 3 different tasks, and each task takes 1 week. Whenever the robot is learning one task, because it does it for a straight week and if it learns all new examples only, the model will completely be overfit to the newest data and forget the other tasks. I am trying to build an algorithm similar to adaptive resonance theory, except with cheaper maintenance cost. – Rui Nian Apr 17 '19 at 13:43
  • Hi Rui, that's a good point. It's an interesting question, and I'm watching it too and hope to see an well-explained answer soon. – Supratim Haldar Apr 17 '19 at 15:06
  • @Rui Nian what do you mean by catastrophic interference? – naive Apr 20 '19 at 10:08
  • And, *Also, should I set up the update step so it samples data from all distributions of my predicted variable uniformly so it gets an "even" update...* what do you mean by this? – naive Apr 20 '19 at 10:10
  • The gradient descent update is biased towards the most recent experiences because the step size parameter (learning rate) is constant. Because of this, training will learn more recent experiences well and forget past experiences. That is catastrophic interference: https://en.wikipedia.org/wiki/Catastrophic_interference – Rui Nian Apr 20 '19 at 17:23
  • For the second point, when performing gradient descent and if you sample data randomly, most of your data sampled will be close to the mean, given your distribution is Gaussian. Therefore, experiences that are rare will never get sampled and will be forgotten. By sampling the whole distribution of your data, you will get the true estimate of your data and not overfit to the most common data points. The topic is kind of like sample mean and actual mean. Sample mean will always be wrong, but if you sample selectively, you can more accurate estimate your mean. – Rui Nian Apr 20 '19 at 17:25

0 Answers0