Why do Soft Actor-Critic with automatic temperature tuning use only a single dual variable?

Question

In section 5 of the paper “Soft Actor Critic Algorithms and Applications”, the authors propose to optimize the policy subject to the constraints that the entropy of action distribution should be greater than a specific value $H_0$.

$ \text{argmax}_{\pi}{\left[\sum_{t=0}^{T}{r(s_t,a_t)}\right]}\ s.t.\ \mathbb{E}\left[-\log{\pi(a_t|s_t)} \right]\geq H_0\ \forall t $

This is then converted to a dual problem, and the temperature parameter $\alpha$ is essentially the dual variable in Lagrange function. However, I don’t know why the authors use only a single dual variable $\alpha$. Since the constraint applies to all possible $t$, the Lagrange function should be: $ L = \sum_{t=0}^{T}{r(s_t,a_t)} + \sum_{t=0}^{T}{\alpha_t \cdot(\mathbb{E}_{a_t\sim \pi, s_t\sim p_s}{\left[-\log(\pi(a_t|s_t))\right]-H_0)}} $

And there should be multiple $\alpha_t$ to solve. Mathematically, how could we end up in optimizing only a single $\alpha$ in the algorithm?

The temperature doesn’t vary per time step. Look at the original paper where they don’t introduce the entropy tuning — it is a fixed hyper parameter. When they introduce the tuning of the temperature, it is still a single scalar value we’d like to learn, it doesn’t vary by time. — David, Feb 06 '23 at 00:03
@DavidIreland The two versions of SAC uses two different objective functions. The first version of SAC (https://arxiv.org/abs/1801.01290) maximize the weighted sum of reward and entropy, and there is only one alpha. The second version (https://arxiv.org/abs/1812.05905) uses another objective to tune alpha: maximize the sum of reawrd subject to contraints in action entropy at each timestep. So in the Lagrange function there should be a different alpha for each timestep. Conceptiually I know there should be one alpha, but I just don't know why using a single alpha is mathematically correct. — Cloudy, Feb 06 '23 at 10:29
The optimisation objective that they introduce for $\alpha$ is minimised at each time step. So, there should be only one dual variable, it is just that we solve a _new_ objective (depending on $t$) for each time step. So, to relate to your question, we actually solve a different constrained optimisation problem at each time step, where each optimisation problem has a single dual variable. — David, Feb 06 '23 at 10:41

Why do Soft Actor-Critic with automatic temperature tuning use only a single dual variable?

0 Answers0