In working with basic sequence-to-sequence models for machine translation I have been able to achieve decent results. But inevitably some translations are not optimal or just flat-out incorrect. I am wondering if there is some way of "correcting" the model when it makes mistakes while not compromising the desirable behavior on translations where it previously performed well.
As an experiment, I took a model that I had previously trained and gathered several examples of translations where it performed poorly. I then took those examples and put them into their own small training set where I provided more desirable translations than what the model was outputting. I then trained the old model on this new small training set very briefly (3-6 training steps was all it took to "learn" the new material). When I tested the new model it translated those several examples in the exact way I had specified. But as I should have anticipated the model overcompensated to "memorize" those handful of new examples and thus I noticed it started to perform poorly on translations that it had previously been excellent.
Is there some way to avoid this behavior short of simply retraining the model from scratch on an updated data set? I think I understand intuitively that the nature of neural networks would not lend itself to small precise corrections (i.e. when the weighting of just a few neurons change the performance of the entire model will change) but maybe there is a way around it, perhaps with some type of hybrid reinforcement learning approach.
Update:
This paper speaks of approaches to incrementally improving neural machine translation models