In most of RL algorithms I saw, there is a coefficient that reduces actions exploration over time, to help convergence.
But in Actor-Critic, or other algorithms (A3C, DDPG, ...) used in continuous action spaces, the different implementation I saw (mainly using Ornstein Uhlenbeck process) is correlated over time, but not decreased.
The action noises are clipped into a range of [-1, 1] and are added to policies that are between [-1, 1] too. So, I don't understand how it could work in environments with hard-to-obtain rewards.
Any thought about this ?