1

I've been reading these two papers from Haarnoja et. al.:

  1. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
  2. Reinforcement Learning with Deep Energy-Based Policies

As far as I can tell, Soft Q-Learning (SQL) and SAC appear very similar. Why is SQL not considered an Actor-Critic method, even though it has an action value network (critic?) and policy network (actor?)? I also cannot seem to find a consensus on the exact definition of an Actor-Critic method.

1 Answers1

1

Indeed SQL is very similar to actor-critic method which has a soft Q-function critic network with parameter $\theta$ and an actor policy network with parameter $\phi$, and in fact the paper "Equivalence Between Policy Gradients and Soft Q-Learning" by Schulman et al proves equivalence between the gradient of soft Q-learning within maximum entropy RL framework and natural policies with entropy regularization in its gradient estimator, and further talked about soft Q-learning:

Haarnoja et al. [2017] work in the same setting of soft Q-learning as the current paper, and they are concerned with tasks with high-dimensional action spaces, where we would like to learn stochastic policies that are multi-modal, and we would like to use Q-functions for which there is no closed-form way of sampling from the Boltzmann distribution $π(a|s) ∝ π(a|s) exp(Q(s,a)/τ)$. Hence, they use a method called Stein Variational Gradient Descent to derive a procedure that jointly updates the Q-function and a policy $π$, which approximately samples from the Boltzmann distribution—this resembles variational inference, where one makes use of an approximate posterior distribution.

Having said that, in terms of Haarnoja et al's own words about the subtle difference between SQL and actor-critic method in your first reference below, the estimated sampled action from SVGD network is not used directly to affect the next soft Q-function since the critic network uses minibatch experiences from its stored replay memory to update its parameter instead of the usual advantage function. If it's not sampled accurately enough it may not be stable and converge to an optimal final (stochastic) solution.

Although the soft Q-learning algorithm proposed by Haarnoja et al. (2017) has a value function and actor network, it is not a true actor-critic algorithm: the Q-function is estimating the optimal Q-function, and the actor does not directly affect the Q-function except through the data distribution. Hence, Haarnoja et al. (2017) motivates the actor network as an approximate sampler, rather than the actor in an actor-critic algorithm. Crucially, the convergence of this method hinges on how well this sampler approximates the true posterior.

mohottnad
  • 711
  • 1
  • 1
  • 9
  • Thank you for the detailed answer. If I'm understanding this correctly, is the main point that the sampled actions used for updates to the critic network have to come directly from the actor (on-policy) and not from a replay pool (off-policy) for it to be a true actor-critic method? – frances_farmer Mar 23 '23 at 11:08
  • Not necessarily, as off-policy DDPG leverages DQN's replay buffer idea to uniformly sample minibatch experiences while being a actor-critic method (usually replay buffer only implies off policy not actor-critic). Here Haarnoja et al simply classifies true actor to be *directly* affect the Q-function like in DDPG via the usual n-step return advantage function, not via another gradient ascent of some performance metric for estimating Q-function in SQL *indirectly*. They also noted its MAP variant connection with DDPG in the same paper, thus it's not that essential clear cut for my point of view. – mohottnad Mar 23 '23 at 20:58
  • 1
    I see. I think that makes it clearer, thank you! – frances_farmer Mar 28 '23 at 10:44