Assume you have a pretrained transformer language model (M1) which already underwent reinforcement learning by human feedback (M2). I assume that it is in principle possible to continue the pretraining after RLHF with some additional documents, e.g. high-quality scientific papers (e.g. the whole arXiv), yielding M3.
My question is: Would there be a difference in the quality of answers to scientific questions (that are covered in the additional documents), between M3 (pretraining+RLHF+pretraining) and a model M4 for which pretraining with the additional documents was continued immediately on M1 (pretraining+pretraining+RLHF)?
The question concerns the interference of pretraining and RLHF. RLHF of course has to be performed on a pretrained model (and doesn't affect the linguistic and implicit world knowledge of the pretrained model too much), but continuation of pretraining only after RLHF might cause more trouble: both general world and aligned knowledge might be reduced, and scientific knowledge not so much advanced.
Is there an argument that makes it plausible that pretraining+RLHF+pretraining should work well -- or on the contrary would not work at all?