0

In the typical RL/MDP framework, I have offline data of $(s,a,r,s')$ of expert Atari gameplay.

I'm looking to train a CNN to predict $r$ based on $(s, a)$.

The states are represented by a $4 \times 84 \times 84$ image of the Atari screen, where 4 represents 4 sequential frames, and $84 \times 84$ is the size of the image. The action is an integer from 0 to 3.

I'm not sure how best to merge these two inputs $(s, a)$ together. How should I incorporate the action into the CNN?

nbro
  • 39,006
  • 12
  • 98
  • 176
Snowball
  • 213
  • 1
  • 6
  • So, are you trying to create some kind of inverse RL algorithm? – nbro Jan 02 '22 at 12:43
  • The conventional way is to have a network with output size of 1x4 (for the 4 actions), to predict the output corresponding to each actions. Since action space is small, this would be a good bet. If you really want to merge S and A, you might want to encode A as onehot and concatenate with 1D feature vectors from the CNN. – Sooryakiran Pallikulathil Jan 02 '22 at 18:38
  • My question can be taken out of the RL context for simplicity. Imagine if I had just images (states in the RL example), and additional information, say integers (actions in the RL example), giving a slight hint as to how "good" these images are (rewards in the RL example). How would you add the "additional information" to the CNN? Currently, I just add it in the last linear layer as an additional feature. Are there any obvious pitfalls I am making? – Snowball Jan 02 '22 at 21:52

0 Answers0