How (if possible at all) rewards (from reinforcement learning) can be used to generate data for supervised learning? This is very topical question, because human feedback usually comes in the form or single-number rating, but this rating should be used on the updating the models that were trained using supervised learning (even masked data approach).
I am just starting to explore this topic and I have found so far the entering points into this realm:
- answer to my previous question Can supervised learning be recast as reinforcement learning problem? pointed towards negative answer to this question, but the article is certainly outdated;
- article https://www.assemblyai.com/blog/how-chatgpt-actually-works/ about RL from human feedback (RLHF) may give some clues
- https://arxiv.org/abs/1912.02875 is topical article that can bring the exact answer to my question and I am starting to read it (possibly I will write answer based on it, if there won't be any aswers)
- https://www.youtube.com/watch?v=fZNyHoXgV7M is good video that uses reward-based data augmentation to supervised learning and it initially promised to show connection between supervised and reinforcement learning, but it remained at that - reward-based augmentation. Though the KL divergences (direct and inverted) can be promising for build such conversion among both approaches.
I just have sense that this should be possible and that I am not aware of some important trend.