4

In the paper Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask, they learn a mask for the network by setting up the mask parameters as $M_i = Bern(\sigma(v_i))$. Where $M$ is the parameter mask ($f(x;\theta, M) = f(x;M \odot \theta$), $Bern$ is a Bernoulli sampler, $\sigma$ is the sigmoid function, and $v_i$ is some trainable parameter.

In the paper, they learn $v_i$ using SGD. I was wondering how they managed to do that, because there isn't a reparameterization trick, as there is for some other distributions I see trained on in the literature (example: normal).

mshlis
  • 2,349
  • 7
  • 23

1 Answers1

0

I heard back from the authors of the paper.

As expected the bernoulli sampler is non-differentiable, so as an approximation they use the expectation of the samplers gradient.

$ \begin{align*} \frac{dL}{dv_i} &= \frac{dL}{dBern(\sigma(v_i))} * \frac{dBern(\sigma(v_i))}{d\sigma(v_i)} * \frac{d\sigma(v_i)}{dv_i} \\ &\approx \frac{dL}{dBern(\sigma(v_i))} * \frac{dE[Bern(\sigma(v_i))]}{d\sigma(v_i)} * \frac{d\sigma(v_i)}{dv_i} \\ &= \frac{dL}{dBern(\sigma(v_i))} * \frac{d\sigma(v_i)}{d\sigma(v_i)} * \frac{d\sigma(v_i)}{dv_i} \\ &= \frac{dL}{dBern(\sigma(v_i))} * 1 * \frac{d\sigma(v_i)}{dv_i} \\ \end{align*} $

So the answer ended up being as simple as that.

mshlis
  • 2,349
  • 7
  • 23