I am currently working with visual environments in Reinforcement Learning (RL) and have noticed differing practices regarding preprocessing of image inputs. Specifically, in the Atari environment, a common approach is to first convert RGB images to grayscale, then stack the frames. The channel number of the final environment observation equals the number of stacked frames.
On the other hand, in the DeepMind Control Suite (DMC) environment, the common practice is to directly stack the RGB frames. In this case, the channel number of the final environment observation equals the number of stacked frames times three (corresponding to RGB channels).
Here is a snippet of grayscale processing and frame stacking in the Atari environment:
...
observation = rgb2gray(observation)
observation = stack_frames(observation, stack_size)
...
And here is a snippet of frame stacking in the DMC environment:
...
observation = stack_frames(observation, stack_size)
...
I have two main questions about this:
- Why are there different practices for frame stacking in the Atari and DMC environments?
- Are there any guiding principles or criteria for deciding when grayscale processing should be applied to the image inputs in different visual environments?
Any insights or recommendations on this matter would be greatly appreciated.