In the paper Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask, they learn a mask for the network by setting up the mask parameters as $M_i = Bern(\sigma(v_i))$. Where $M$ is the parameter mask ($f(x;\theta, M) = f(x;M \odot \theta$), $Bern$ is a Bernoulli sampler, $\sigma$ is the sigmoid function, and $v_i$ is some trainable parameter.
In the paper, they learn $v_i$ using SGD. I was wondering how they managed to do that, because there isn't a reparameterization trick, as there is for some other distributions I see trained on in the literature (example: normal).