4

Recently I encountered a variant on the normal linear neural layer architecture: Instead of $Z = XW + B$, we now have $Z = (X-A)W + B$. So we have a 'pre-bias' $A$ that affects the activation of the last layer, before multiplication by weights. I don't understand the backpropagation equations for $dA$ and $dB$ ($dW$ is as expected).

Here is the original paper in which it appeared (although the paper itself isn't actually that relevant): https://papers.nips.cc/paper/4830-learning-invariant-representations-of-molecules-for-atomization-energy-prediction.pdf

Here is the link to the full code of the neural network: http://www.quantum-machine.org/code/nn-qm7.tar.gz

class Linear(Module):

    def __init__(self,m,n):

        self.tr = m**.5 / n**.5
        self.lr = 1 / m**.5
        
        self.W = numpy.random.normal(0,1 / m**.5,[m,n]).astype('float32')
        self.A = numpy.zeros([m]).astype('float32')
        self.B = numpy.zeros([n]).astype('float32')

    def forward(self,X):
        self.X = X
        Y = numpy.dot(X-self.A,self.W)+self.B
        return Y

    def backward(self,DY):
        self.DW = numpy.dot((self.X-self.A).T,DY)
        self.DA = -(self.X-self.A).sum(axis=0)
        self.DB = DY.sum(axis=0) + numpy.dot(self.DA,self.W)
        DX = self.tr * numpy.dot(DY,self.W.T)
        return DX

    def update(self,lr):
        self.W -= lr*self.lr*self.DW
        self.B -= lr*self.lr*self.DB
        self.A -= lr*self.lr*self.DA

    def average(self,nn,a):
        self.W = a*nn.W + (1-a)*self.W
        self.B = a*nn.B + (1-a)*self.B
        self.A = a*nn.A + (1-a)*self.A
Laksh
  • 141
  • 2
  • Never post downloadable links. I am saying this because it can be a virus, or anything..Post the link to the actual code. –  Aug 04 '19 at 10:57
  • My bad, I will post the link to the webpage that contains the code. – Laksh Aug 04 '19 at 13:53
  • The pre-bias A doesn't help with anything, as the layer before already have a bias and it had already done the work. This method is just repeating another bias which don't help. The only thing it does is increase chance of overfitting and increase training time. – Clement Dec 07 '19 at 09:49

1 Answers1

1

The forward prop equation is:

$$ Z = (X-A)W - B = XW - AW - B $$

So the derivatives for $Z$ w.r.t $W$, $A$, $B$ and $X$ should be:

$$ \frac{\partial Z}{\partial W} = X-A \\ \frac{\partial Z}{\partial A} = - W \\ \frac{\partial Z}{\partial B} = - 1 \\ \frac{\partial Z}{\partial X} = W $$

I don't know why he needs the last one though. The first is, like you said, as expected. The other two are wrong, I don't know why he used them in the implementation.

Djib2011
  • 3,163
  • 3
  • 16
  • 21
  • This is exactly what I thought - yet, I have run their code and it works (very well). So there must be something going on here... – Laksh Aug 04 '19 at 09:32
  • DB seems fine except for the last term at the end. Is this some sort of more advanced gradient method? – Laksh Aug 04 '19 at 09:34
  • I've seen some "gradient estimation" techniques for cases where the gradient isn't computable but it doesn't seem to be the case here. I also think they would have stated it, seems like a mistake to me. Now why does it work if it is wrong? Well, to be honest, the most important parameters to get right are the weights $W$, so technically it could work even without the biases... If I were you I'd change the code with the correct gradients and see if maybe it works better. – Djib2011 Aug 04 '19 at 12:16
  • I have tried running it without the A bias, and it works very well too. Will try the 'correct' backprop equations. Also, I have emailed the author of the paper with this link so hopefully he should respond. – Laksh Aug 04 '19 at 13:52