3

Is it possible to use any particular strategy to explore (e.g. metaheuristics) in on-policy algorithms (e.g. in PPO) or is it only possible to define particular policies to explore in off-policy algorithms (e.g. TD3)?

nbro
  • 39,006
  • 12
  • 98
  • 176
Pulse9
  • 282
  • 1
  • 7

1 Answers1

1

In part it depends on the on-policy method you are using. In general you are not free to change the policy arbitrarily for on-policy policy gradient methods such as PPO or A3C.

However, if you are willing to consider the added exploration strategy as part of the current target policy, and can express it mathematically, you should be able to add an exploration term to on-policy approaches:

  • For value-based on-policy methods like SARSA, then there is no requirement to base the current policy on the learned value function. However, you will probably want to reduce the influence of the exploration heuristic over time, otherwise the algorithm may not converge. A simple way to do this would be to weight the heuristic for each action, and add it to the current value estimates when deciding the greedy action, slowly decaying the weight of the heuristic down to zero over time.

  • For policy gradient methods, the adjustment is harder. Your heuristic needs to be introduced under control of a parameter of the policy function, and should be differentiable. You might be able to do this simply and directly, but it will depend on details. For some exploration functions it not be possible at all.

Perhaps a more tractable approach to work with on-policy policy gradient methods would be to pre-train the policy network to approximate the heuristic function. The policy would then start like the heuristic then evolve towards optimal control. This works provided your heuristic outputs a probability distribution that is only dependent on current state, and without additional information such as number of times the same action was already taken.

If you want to explore using a heuristic without decaying it, or losing the exploration in the longer term, but still target an optimal policy, then you must use an off-policy method. Most off-policy methods are extensions of on-policy versions that have been adjusted/extended to deal with a split between behaviour and target policies. They are necessarily more complex as a result.

If you want to use a custom exploration function with policy gradients, then you may have some luck adjusting Deep Deterministic Policy Gradient (DDPG) where the exploration function is already a separate component and could be replaced - there are already a couple of variants in use.

Neil Slater
  • 28,678
  • 3
  • 38
  • 60
  • Could you explain a bit more on "reduce the influence of the exploration heuristic over time" (mentioned in the SARSA paragraph)? Is it like decaying the epsilon over time in epsilon-greedy policy? – Cloudy Mar 24 '22 at 12:43
  • @Cloudy: Yes I am suggesting some kind of decay or multiplying the heuristc by an "envelope function" that drops from 1 towards 0 over time, e.g. $\frac{1}{\sqrt{t}}$ – Neil Slater Mar 24 '22 at 14:01