4

David Silver argues, in his Reinforcement Learning course, that policy-based reinforcement learning (RL) is more effective than value-based RL in high-dimensional action spaces. He points out that the implicit policy (e.g., $\epsilon$-greedy) in Q-learning or Sarsa looks for a maximum for $Q_\pi$ in the action space in each time step, which may be infeasible when there are many actions.

However, from my understanding, policy gradient methods that use a neural network to represent the policy choose their actions from the output of a softmax:

$$ \pi(a|s, \theta) = \frac{e^{h(s, a, \theta)}}{\sum_b e^{h(s, b, \theta)}}, $$

where the denominator normalizes between all possible actions $b\in \mathcal{A}$ and $h(s, a, \theta)$ is a numerical preference (notation from the Reinforcement Learning book by Sutton and Barto) that is usually represented by a neural network.

By using a softmax in action preferences, don't we compute $\pi(a|s, \theta)$ for all actions, ending up with the same computational complexity as in value-based methods?

Saucy Goat
  • 143
  • 4
  • Different but interesting related [question](https://ai.stackexchange.com/questions/27849/are-policy-gradient-methods-good-for-large-discrete-action-spaces) – Saucy Goat Dec 16 '22 at 12:58

1 Answers1

2

Above softmax in action preferences is used for policy gradient methods with (large) spaces with discrete actions, while for continuous spaces with infinite number of actions Gaussian distribution is often used instead thus you don't need to compute all those potentially large amount of numerical preferences via NN or whatever models.

In the above softmax case sometimes you may simply fit the preferences as linear function of the feature vector $x(s,a)$ which is often much simpler than computing action values via Monte Carlo, TD or DP which all have much higher time complexity. Indeed in these cases policy gradient methods are more effective to learn since policy function is in fact much easier to approximate than action value function or certain types of domain knowledge can be incorporated in policy function as explained in the paper Why Most Decisions Are Easy in Tetris.

most of the problems are easy in the following sense: One can choose well among the available actions without knowing an evaluation function that scores well in the game. This is a consequence of three conditions that are prevalent in the game: simple dominance, cumulative dominance, and noncompensation. These conditions can be exploited to develop faster and more effective learning algorithms. In addition, they allow certain types of domain knowledge to be incorporated with ease into a learning algorithm

Finally if your softmax has to employ action values to compute the preferences such as in most actor-critic methods, then yes of course it has same computational complexity as its corresponding value-based methods during each time step which you cannot avoid. But additionally softmax would allow agent to explore different actions by sampling from the resulting distribution, rather than always select the action with the highest estimated value, thus it can find stochastic optimal policies in problems with heavy function approximations while "deterministic" value-based methods cannot.

mohottnad
  • 711
  • 1
  • 1
  • 9
  • It's not the complexity of the function approximator I'm confused about. Rather, at "inference" time, the softmax output vector has the same dimensionality as the number of actions for a given state, so the complexity of choosing an action ends up being the same whether we're using a policy or value-based method. (Apparently that's not the case and policy-based methods are more efficient, I just don't see why.) – Saucy Goat Dec 17 '22 at 16:05
  • See my latest edit if that helps. – mohottnad Dec 17 '22 at 22:48
  • Thanks for the reply! So if I read correctly, you _don't_ think policy-based methods are more efficient at choosing an action than value-based methods (setting aside the complexity of learning the preferences themselves)? – Saucy Goat Dec 18 '22 at 12:40
  • The complexity of learning the preferences themselves using action values in softmax as explained above is of same complexity as, say, Q/sarsa-learning. Setting aside this complexity, once you have the (approximated) softmax at hand in each time step, then you just need to *sample one* function value from it to select an action to proceed to policy parameters update, no need to worry about all the other numerous actions. – mohottnad Dec 18 '22 at 17:34
  • Additionally, though with same comupational complexity as value-based apporach in your case, the softmax function has the advantage of being able to smoothly interpolate between exploitation (selecting the action with the highest action-value) and exploration (selecting actions randomly with a probability that is proportional to their action-values). This can be useful for improving the efficiency of learning by allowing the agent to balance exploration and exploitation. – mohottnad Dec 18 '22 at 22:09
  • What do you mean by _sample one_ action from the softmax? I think that's where my confusion lies. The "output" of the agent (which we can think of as a neural network) is a probability distribution over actions, which we can think of as a vector of probabilities. Then to select an action for a given state, we do a forward pass through the agent (network) and get a vector of probabilities (or preferences) for each action. To then select an action, we need to search the entire vector looking for which action is assigned the greatest preference. Is this correct? – Saucy Goat Dec 19 '22 at 14:41
  • softmax function just acts like a scalar probability mass function, not a vector of probabilities. It's like a binomial distribution histogram, then to select an action during each time step you just need to randomly sample according to such a probability distribution. – mohottnad Dec 19 '22 at 21:42
  • We can think of a probability mass function as a one-to-one mapping from each possible value (in this case, each action) of a random variable to its probability. In that case, we can always know the probability of choosing a particular action, but how can we "sample" (i.e., choose an action) without checking the probabilities of all the actions? – Saucy Goat Dec 29 '22 at 19:29
  • Once you get the (approximate) softmax as probability mass function via whatever means, there're many statistical ways such as inverse transform or Metropolis-Hasting to do sampling according to a specific probability distribution. In practice such as using python we can use random or scipy.stats packages such as random.gauss(5,2) or scipy.stats.binom.rvs(n=10, p=0.5) which hide all the algo complexities. – mohottnad Dec 29 '22 at 19:45
  • So the reason that policy gradient methods are more effective for high-dimensional action spaces is that they can use these algorithms for efficient sampling from a PMF, while $\epsilon$-soft policies are stuck with having to look at all the possible actions to choose the one that maximizes value? – Saucy Goat Dec 30 '22 at 13:27
  • Exactly makes sense for me! – mohottnad Dec 30 '22 at 20:03