3

I am currently using Proximal Policy Optimization (PPO) to solve my RL task. However, after reading about Soft Actor-Critic (SAC) now I am unsure whether I should stick to PPO or switch to SAC. Moreover, from this post, it seems that much of the performance in the original PPO paper comes from code optimizations and not the novel clipped objective.

The main characteristics of my RL task are the following:

  • The action space is discrete. SAC was originally designed for continuous action spaces but, if I'm not wrong, it can be adapted to discrete action spaces with no problem.
  • I am trying to learn a policy for generating synthetic data (i.e., generating novel graphs), so diversity is key. For this reason, I want to learn a policy with as much entropy as possible (while still solving the task). Both PPO and SAC try to maximize the policy entropy.
  • Obtaining trajectories to train the policy is very expensive. My algorithm spends much more time obtaining the trajectories than training the deep neural network of the policy. Here, I think SAC is the clear winner, as it is off-policy whereas PPO is on-policy. Still, PPO is supposed to be very sample-efficient.

Given my current needs, do you think it is worth it to switch to SAC instead of PPO?

Aeryan
  • 53
  • 1
  • 4
  • Rather than writing a very generic title like "PPO vs SAC for discrete action spaces", please, put your **specific question** in the title! Thanks. – nbro Jun 29 '22 at 22:29
  • 1
    Thanks for the comment. Do you think something like *Does SAC perform better than PPO in sample-expensive tasks with discrete action spaces?* would do or is it too long? – Aeryan Jul 01 '22 at 14:19
  • I don't think it's too long, but I don't know it's accurate because I didn't fully read your post. If it's accurate, it seems good ;) – nbro Jul 01 '22 at 16:15
  • Okay, just changed the title ^^ – Aeryan Jul 03 '22 at 14:54
  • PPO does not solve a maximum entropy objective as far as I know. – Rémy Hosseinkhan Boucher Jan 30 '23 at 20:29

1 Answers1

1

First, both SAC and PPO are usable for continuous and discrete action spaces. However, in the case of discrete action spaces, SAC cost functions must be previously adapted. As explained in this Stable Baselines3 issue, its efficient implementation is not an easy task.

Contrary to your hypotheses, off-policy algorithms as SAC are generally more sample-efficient than on-policy algorithms (i.e. PPO), which are generally sample-inefficient due to the data loss that occurs when updating its policy. However, if you're looking for faster training, PPO is your option.

Moreover, as explained by Haarnoja et al. (2018), SAC allows to have a stochastic actor while being more optimal and sample efficient than on-policy methods such as A3C or PPO. It is also less sensible to hyperparameters than all these methods.

Depending on the problem you are dealing with, this Reddit thread may provide you further indications. Finally, here you have a post with some additional information about DRL algorithms comparison.

In conclusion: for a generic problem I would recommend using SAC as it has shown better performance in most problems. However, if in your case performance is not an aspect to evaluate and you value more the speed of training and lightness of implementation, I recommend PPO. Another option could be to use both and compare their performance on your specific problem, which could also be quite interesting. Finally, if you are open to more alternatives, don't rule out TD3 if none of the above convinces you!

Antonio
  • 26
  • 4
  • Actually, the SAC cost functions do not need to be adapted as per the linked article. You can obtain differentiator samples from a discrete distribution using the Gumbell Softmax trick. – David Jul 03 '22 at 16:30
  • @Antonio I have read both the Medium post you linked and the article https://arxiv.org/pdf/1910.07207.pdf, explaining how to adapt SAC to discrete action settings and it seems quite straightforward. The only modification to the policy, actor and temperature cost functions is that, for the expectation over policy actions, we can now calculate it exactly as we have a discrete probability distribution (while before we had to approximate it with samples). So I don't know why in the Stable Baselines3 issue says the discrete SAC implementation is not trivial... – Aeryan Jul 07 '22 at 00:06
  • @DavidIreland It's true that the Gumbell-softmax trick allows the gradients to flow across stochastic nodes (i.e., a node where a discrete probability distribution is sampled) but, related to my previous comment, it seems like there is no point in doing that. Instead, if we simply modify the cost functions (for Discrete SAC) we are now able to exactly calculate the expectation over policy actions. This reduces the variance in the estimation of the policy objective. This is explained in arxiv.org/pdf/1910.07207.pdf (paragraph on top of Equation 10). – Aeryan Jul 07 '22 at 00:13
  • There is a point. For instance, if you have both continuous and discrete actions then it is natural to sample from your discrete actions using the Gumbell-softmax trick without making any of the aforementioned modifications. If you have only discrete actions then it is not advisable to use SAC when the DQN and it's variants perform so well. – David Jul 07 '22 at 10:11
  • @DavidIreland I understand your point. However, even if only discrete actions are used, I think SAC is still a sensible alternative to DQN. The authors of the Discrete SAC paper (arxiv.org/pdf/1910.07207.pdf) show that the performance of discrete-SAC is very similar to that of Rainbow, even without tuning the SAC hyperparameters. That along with the fact that implementing discrete-SAC seems simpler than implementing Rainbow or other complex DQN variants seems like a good reason for using SAC over DQN (at least in some scenarios). – Aeryan Jul 07 '22 at 14:06
  • @Aeryan , I agree, as you pointed out it might be the necessary modification of the entropy estimation in the implementation + some details which make the adaptation not straightforward but still quite intuitive? – Rémy Hosseinkhan Boucher Jan 30 '23 at 20:34