In a Restricted Boltzmann Machine (RBM), the likelihood function is:
$$p(\mathbf{v};\mathbf{\theta}) = \frac{1}{Z} \sum_{\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}$$
Where $E$ is the energy function and $Z$ is the partition function:
$$Z = \sum_{\mathbf{v},\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}$$
The log-likelihood function is therefore:
$$ln(p(\mathbf{v};\mathbf{\theta})) = ln\left(\sum_{\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}\right) - ln\left(\sum_{\mathbf{v},\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}\right)$$
Since the log-likelihood function cannot be computed, its gradient is used instead with gradient descent to find the optimal parameters $\mathbf{\theta}$:
$$\frac{\partial ln(p(\mathbf{v};\mathbf{\theta}))}{\partial \mathbf{\theta}} = -\frac{1}{\sum_{\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}} \sum_{\mathbf{h}} \left[\frac{\partial E(\mathbf{v},\mathbf{h};\mathbf{\theta})}{\partial \mathbf{\theta}} \cdot e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}\right] + \frac{1}{\sum_{\mathbf{v},\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}} \sum_{\mathbf{v},\mathbf{h}} \left[\frac{\partial E(\mathbf{v},\mathbf{h};\mathbf{\theta})}{\partial \mathbf{\theta}} \cdot e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}\right]$$
Since:
$$p(\mathbf{h}|\mathbf{v}) = \frac{p(\mathbf{v},\mathbf{h})}{p(\mathbf{v})} = \frac{\frac{1}{Z} e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}}{\frac{1}{Z} \sum_{\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}} = \frac{e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}}{\sum_{\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}}$$
Then:
$$\frac{\partial ln(p(\mathbf{v};\mathbf{\theta}))}{\partial \mathbf{\theta}} = -\sum_{\mathbf{h}} \left[\frac{\partial E(\mathbf{v},\mathbf{h};\mathbf{\theta})}{\partial \mathbf{\theta}} \cdot p(\mathbf{h}|\mathbf{v}) \right] + \frac{1}{\sum_{\mathbf{v},\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}} \sum_{\mathbf{v},\mathbf{h}} \left[\frac{\partial E(\mathbf{v},\mathbf{h};\mathbf{\theta})}{\partial \mathbf{\theta}} \cdot e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}\right]$$
Also, since:
$$ \frac{e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}}{Z} = \frac{e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}}{\sum_{\mathbf{v},\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}} = p(\mathbf{v},\mathbf{h})$$
Then:
$$\begin{align} \frac{\partial ln(p(\mathbf{v};\mathbf{\theta}))}{\partial \mathbf{\theta}} &= -\sum_{\mathbf{h}} \left[\frac{\partial E(\mathbf{v},\mathbf{h};\mathbf{\theta})}{\partial \mathbf{\theta}} \cdot p(\mathbf{h}|\mathbf{v}) \right] + \sum_{\mathbf{v},\mathbf{h}} \left[\frac{\partial E(\mathbf{v},\mathbf{h};\mathbf{\theta})}{\partial \mathbf{\theta}} \cdot p(\mathbf{v},\mathbf{h})\right] \\ &= -\mathbb{E}_{p(\mathbf{h}|\mathbf{v})}\left[\frac{\partial E(\mathbf{v},\mathbf{h};\mathbf{\theta})}{\partial \mathbf{\theta}} \right] + \mathbb{E}_{p(\mathbf{v},\mathbf{h})}\left[\frac{\partial E(\mathbf{v},\mathbf{h};\mathbf{\theta})}{\partial \mathbf{\theta}} \right] \end{align}$$
Since both of these are expectations, they can be approximated using Monte Carlo integration:
$$ \frac{\partial ln(p(\mathbf{v};\mathbf{\theta}))}{\partial \mathbf{\theta}} \approx -\frac{1}{N} \sum_{i = 1}^{N} \left[\frac{\partial E(\mathbf{v},\mathbf{h}_i;\mathbf{\theta})}{\partial \mathbf{\theta}} \right] + \frac{1}{M} \sum_{j=1}^{M} \left[\frac{\partial E(\mathbf{v}_j,\mathbf{h}_j;\mathbf{\theta})}{\partial \mathbf{\theta}} \right] $$
The first term can be computed beacuse it is easy to sample from $p(\mathbf{h}|\mathbf{v})$. However, it is difficult to sample from $p(\mathbf{v},\mathbf{h})$ directly, but since it is easy to sample from $p(\mathbf{v}|\mathbf{h})$, then Gibbs sampling is used to sample from both $p(\mathbf{h}|\mathbf{v})$ and $p(\mathbf{v}|\mathbf{h})$ to approximate a sample from $p(\mathbf{v},\mathbf{h})$.
My questions are:
- Is my understanding and math correct so far?
- In the expression for the gradient of the log-likelihood, can expectations be interchanged with partial derivatives such that:
$$\begin{align} \frac{\partial ln(p(\mathbf{v};\mathbf{\theta}))}{\partial \mathbf{\theta}} &= -\mathbb{E}_{p(\mathbf{h}|\mathbf{v})}\left[\frac{\partial E(\mathbf{v},\mathbf{h};\mathbf{\theta})}{\partial \mathbf{\theta}} \right] + \mathbb{E}_{p(\mathbf{v},\mathbf{h})}\left[\frac{\partial E(\mathbf{v},\mathbf{h};\mathbf{\theta})}{\partial \mathbf{\theta}} \right] \\ &= - \frac{\partial}{\partial \mathbf{\theta}} \mathbb{E}_{p(\mathbf{h}|\mathbf{v})}\left[E(\mathbf{v},\mathbf{h};\mathbf{\theta}) \right] + \frac{\partial}{\partial \mathbf{\theta}} \mathbb{E}_{p(\mathbf{v},\mathbf{h})}\left[E(\mathbf{v},\mathbf{h};\mathbf{\theta}) \right] \\ &= \frac{\partial}{\partial \mathbf{\theta}} \left(\mathbb{E}_{p(\mathbf{v},\mathbf{h})}\left[E(\mathbf{v},\mathbf{h};\mathbf{\theta}) \right] - \mathbb{E}_{p(\mathbf{h}|\mathbf{v})}\left[E(\mathbf{v},\mathbf{h};\mathbf{\theta}) \right] \right) \\ &\approx \frac{\partial}{\partial \mathbf{\theta}} \left(\frac{1}{M} \sum_{j=1}^{M} \left[E(\mathbf{v}_j,\mathbf{h}_j;\mathbf{\theta}) \right] - \frac{1}{N} \sum_{i = 1}^{N} \left[E(\mathbf{v},\mathbf{h}_i;\mathbf{\theta}) \right] \right) \end{align}$$
- After approximating the gradient of the log-likelihood, the update rule for the parameter vector $\mathbf{\theta}$ is:
$$\mathbf{\theta}_{t+1} = \mathbf{\theta}_{t} + \epsilon \frac{\partial ln(p(\mathbf{v};\mathbf{\theta}))}{\partial \mathbf{\theta}}$$
Where $\epsilon$ is the learning rate. Is this update rule correct?