How do I derive the gradient of the log-likelihood of an RBM?

Question

In a Restricted Boltzmann Machine (RBM), the likelihood function is:

$$p(\mathbf{v};\mathbf{\theta}) = \frac{1}{Z} \sum_{\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}$$

Where $E$ is the energy function and $Z$ is the partition function:

$$Z = \sum_{\mathbf{v},\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}$$

The log-likelihood function is therefore:

$$ln(p(\mathbf{v};\mathbf{\theta})) = ln\left(\sum_{\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}\right) - ln\left(\sum_{\mathbf{v},\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}\right)$$

Since the log-likelihood function cannot be computed, its gradient is used instead with gradient descent to find the optimal parameters $\mathbf{\theta}$:

$$\frac{\partial ln(p(\mathbf{v};\mathbf{\theta}))}{\partial \mathbf{\theta}} = -\frac{1}{\sum_{\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}} \sum_{\mathbf{h}} \left[\frac{\partial E(\mathbf{v},\mathbf{h};\mathbf{\theta})}{\partial \mathbf{\theta}} \cdot e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}\right] + \frac{1}{\sum_{\mathbf{v},\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}} \sum_{\mathbf{v},\mathbf{h}} \left[\frac{\partial E(\mathbf{v},\mathbf{h};\mathbf{\theta})}{\partial \mathbf{\theta}} \cdot e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}\right]$$

Since:

$$p(\mathbf{h}|\mathbf{v}) = \frac{p(\mathbf{v},\mathbf{h})}{p(\mathbf{v})} = \frac{\frac{1}{Z} e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}}{\frac{1}{Z} \sum_{\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}} = \frac{e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}}{\sum_{\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}}$$

Then:

$$\frac{\partial ln(p(\mathbf{v};\mathbf{\theta}))}{\partial \mathbf{\theta}} = -\sum_{\mathbf{h}} \left[\frac{\partial E(\mathbf{v},\mathbf{h};\mathbf{\theta})}{\partial \mathbf{\theta}} \cdot p(\mathbf{h}|\mathbf{v}) \right] + \frac{1}{\sum_{\mathbf{v},\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}} \sum_{\mathbf{v},\mathbf{h}} \left[\frac{\partial E(\mathbf{v},\mathbf{h};\mathbf{\theta})}{\partial \mathbf{\theta}} \cdot e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}\right]$$

Also, since:

$$ \frac{e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}}{Z} = \frac{e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}}{\sum_{\mathbf{v},\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}} = p(\mathbf{v},\mathbf{h})$$

Then:

$$\begin{align} \frac{\partial ln(p(\mathbf{v};\mathbf{\theta}))}{\partial \mathbf{\theta}} &= -\sum_{\mathbf{h}} \left[\frac{\partial E(\mathbf{v},\mathbf{h};\mathbf{\theta})}{\partial \mathbf{\theta}} \cdot p(\mathbf{h}|\mathbf{v}) \right] + \sum_{\mathbf{v},\mathbf{h}} \left[\frac{\partial E(\mathbf{v},\mathbf{h};\mathbf{\theta})}{\partial \mathbf{\theta}} \cdot p(\mathbf{v},\mathbf{h})\right] \\ &= -\mathbb{E}_{p(\mathbf{h}|\mathbf{v})}\left[\frac{\partial E(\mathbf{v},\mathbf{h};\mathbf{\theta})}{\partial \mathbf{\theta}} \right] + \mathbb{E}_{p(\mathbf{v},\mathbf{h})}\left[\frac{\partial E(\mathbf{v},\mathbf{h};\mathbf{\theta})}{\partial \mathbf{\theta}} \right] \end{align}$$

Since both of these are expectations, they can be approximated using Monte Carlo integration:

$$ \frac{\partial ln(p(\mathbf{v};\mathbf{\theta}))}{\partial \mathbf{\theta}} \approx -\frac{1}{N} \sum_{i = 1}^{N} \left[\frac{\partial E(\mathbf{v},\mathbf{h}_i;\mathbf{\theta})}{\partial \mathbf{\theta}} \right] + \frac{1}{M} \sum_{j=1}^{M} \left[\frac{\partial E(\mathbf{v}_j,\mathbf{h}_j;\mathbf{\theta})}{\partial \mathbf{\theta}} \right] $$

The first term can be computed beacuse it is easy to sample from $p(\mathbf{h}|\mathbf{v})$. However, it is difficult to sample from $p(\mathbf{v},\mathbf{h})$ directly, but since it is easy to sample from $p(\mathbf{v}|\mathbf{h})$, then Gibbs sampling is used to sample from both $p(\mathbf{h}|\mathbf{v})$ and $p(\mathbf{v}|\mathbf{h})$ to approximate a sample from $p(\mathbf{v},\mathbf{h})$.

My questions are:

Is my understanding and math correct so far?
In the expression for the gradient of the log-likelihood, can expectations be interchanged with partial derivatives such that:

$$\begin{align} \frac{\partial ln(p(\mathbf{v};\mathbf{\theta}))}{\partial \mathbf{\theta}} &= -\mathbb{E}_{p(\mathbf{h}|\mathbf{v})}\left[\frac{\partial E(\mathbf{v},\mathbf{h};\mathbf{\theta})}{\partial \mathbf{\theta}} \right] + \mathbb{E}_{p(\mathbf{v},\mathbf{h})}\left[\frac{\partial E(\mathbf{v},\mathbf{h};\mathbf{\theta})}{\partial \mathbf{\theta}} \right] \\ &= - \frac{\partial}{\partial \mathbf{\theta}} \mathbb{E}_{p(\mathbf{h}|\mathbf{v})}\left[E(\mathbf{v},\mathbf{h};\mathbf{\theta}) \right] + \frac{\partial}{\partial \mathbf{\theta}} \mathbb{E}_{p(\mathbf{v},\mathbf{h})}\left[E(\mathbf{v},\mathbf{h};\mathbf{\theta}) \right] \\ &= \frac{\partial}{\partial \mathbf{\theta}} \left(\mathbb{E}_{p(\mathbf{v},\mathbf{h})}\left[E(\mathbf{v},\mathbf{h};\mathbf{\theta}) \right] - \mathbb{E}_{p(\mathbf{h}|\mathbf{v})}\left[E(\mathbf{v},\mathbf{h};\mathbf{\theta}) \right] \right) \\ &\approx \frac{\partial}{\partial \mathbf{\theta}} \left(\frac{1}{M} \sum_{j=1}^{M} \left[E(\mathbf{v}_j,\mathbf{h}_j;\mathbf{\theta}) \right] - \frac{1}{N} \sum_{i = 1}^{N} \left[E(\mathbf{v},\mathbf{h}_i;\mathbf{\theta}) \right] \right) \end{align}$$

After approximating the gradient of the log-likelihood, the update rule for the parameter vector $\mathbf{\theta}$ is:

$$\mathbf{\theta}_{t+1} = \mathbf{\theta}_{t} + \epsilon \frac{\partial ln(p(\mathbf{v};\mathbf{\theta}))}{\partial \mathbf{\theta}}$$

Where $\epsilon$ is the learning rate. Is this update rule correct?

Hey @nbro, this is the second question that I ask that contains technical details and receives no answers. I think this is because this question is not what is usually asked on the ai.stackexchange website. After browsing the questions a bit, it seems most of the questions are practical like "How do I choose the optimal batch size?". Which site is more appropriate for theoretical questions like this one? I tried to ask on stat.stackexchange, but I still get no answers because that site is more oriented towards pure probability and statistics and not machine learning. Any suggestions? — mhdadk, Oct 31 '20 at 12:03
Hello. There's is the site Cross Validated SE that also accepts questions related to machine learning. Even if you ask your question there (and often cross-posting is discouraged), in this case, I suggest that you do **not** delete this post from our site, given that it is on-topic here too, and this is the type of question that I think we should get more here (i.e. technical questions), that's also why I upvoted it. Moreover, note that you may also not receive an answer on CV SE. Unfortunately, right now, I am not familiar with RBMs, otherwise, I would give it try to answer it. — nbro, Oct 31 '20 at 14:51
One way to attract more visitors and potentially someone that is able to answer your question is to open a bounty (but I think you still don't have enough reputation to do that) or to edit your post so that it pops up in the active questions of the site and people can regularly see it. Of course, none of these "tricks" will guarantee that you will get an answer. — nbro, Oct 31 '20 at 14:56

How do I derive the gradient of the log-likelihood of an RBM?

0 Answers0