Why can't the XOR linear inseparability problem be solved with one perceptron like this?

Question

Consider a perceptron where $w_0=1$ and $w_1=1$:

Now, suppose that we use the following activation function

\begin{align} f(x)= \begin{cases} 1, \text{ if }x =1\\ 0, \text{ otherwise} \end{cases} \end{align}

The output is then summarised as:

\begin{array}{|c|c|c|c|} \hline x_0 & x_1 & w_0x_0 + w_1x_1 & f( \cdot )\\ \hline 0 & 0 & 0 & 0 \\ \hline 0 & 1 & 1 & 1 \\ \hline 1 & 0 & 1 & 1 \\ \hline 1 & 1 & 2 & 0 \\ \hline \end{array}

Is there something wrong with the way I've defined the activation function?

score 4 · Accepted Answer · answered Dec 15 '18 at 18:14

It can be done.

The activation function of a neuron does not have to be monotonic. The activation that Rahul suggested can be implemented via a continuously differentiable function, for example $ f(s) = exp(-k(1-s)^2) $ which has a nice derivative $f'(s) = 2k~(1-s)f(s)$. Here, $s=w_0~x_0+w_1~x_1$. Therefore, standard gradient-based learning algorithms are applicable.

The neuron's error is $ E = \frac{1}{2}(v-v_d)^2$, where $v_d$ - desired output, $v$ - actual output. The weights $w_i, ~i=0,1$ are initialized randomly and then updated during training as follows $$w_i \to w_i - \alpha\frac{\partial E}{\partial w_i}$$ where $\alpha$ is a learning rate. We have $$\frac{\partial E}{\partial w_i} = (v-v_d)\frac{\partial v}{\partial w_i}=(f(s)-v_d)~\frac{\partial f}{\partial s}\frac{\partial s}{\partial w_i}=2k~(f(s)-v_d)(1-s)f(s)~x_i$$ Let's test it in Python.

import numpy as np 
import matplotlib.pyplot as plt

For training, I take a few points randomly scattered around $[0, 0]$, $[0, 1]$, $[1, 0]$, and $[1, 1]$.

n = 10
sd = [0.05, 0.05]

x00 = np.random.normal(loc=[0, 0], scale=sd, size=(n,2))
x01 = np.random.normal(loc=[0, 1], scale=sd, size=(n,2))
x10 = np.random.normal(loc=[1, 0], scale=sd, size=(n,2))
x11 = np.random.normal(loc=[1, 1], scale=sd, size=(n,2))

x = np.vstack((x00,x01,x10,x11))
y = np.vstack((np.zeros((x00.shape[0],1)), 
               np.ones((x01.shape[0],1)), 
               np.ones((x10.shape[0],1)), 
               np.zeros((x11.shape[0],1)))).ravel()

ind = np.arange(len(y))
np.random.shuffle(ind)

x = x[ind]
y = y[ind]
N = len(y)

plt.scatter(*x00.T, label='00')
plt.scatter(*x01.T, label='01')
plt.scatter(*x10.T, label='10')
plt.scatter(*x11.T, label='11')
plt.legend()
plt.show()

Activation function:

k = 10

def f(s):
    return np.exp(-k*(s-1)**2)

Initialize the weights, and train the network:

w = np.random.uniform(low=0.25, high=1.75, size=(2))

print("Initial w:", w)

rate = 0.01
n_epochs = 20

error = []
for _ in range(n_epochs):
    err = 0
    for i in range(N):
        s = np.dot(x[i],w)
        w -= rate * 2 * k * (f(s) - y[i]) * (1-s) * f(s) * x[i]
        err += 0.5*(f(s) - y[i])**2
    err /= N
    error.append(err)

print('Final w:', w)

The weights have indeed converged to $w_0=1,~w_1=1$:

Initial w: [1.5915165  0.27594833]
Final w: [1.03561356 0.96695205]

The training error is decreasing:

plt.scatter(np.arange(n_epochs), error)
plt.grid()
plt.xticks(np.arange(0, n_epochs, step=1))
plt.show()

Let's test it. I create a testing set in the same way as the training set. My test data are different from my training data because I didn't fix the seed.

x00 = np.random.normal(loc=[0, 0], scale=sd, size=(n,2))
x01 = np.random.normal(loc=[0, 1], scale=sd, size=(n,2))
x10 = np.random.normal(loc=[1, 0], scale=sd, size=(n,2))
x11 = np.random.normal(loc=[1, 1], scale=sd, size=(n,2))

x_test = np.vstack((x00,x01,x10,x11))
y_test = np.vstack((np.zeros((x00.shape[0],1)), 
               np.ones((x01.shape[0],1)), 
               np.ones((x10.shape[0],1)), 
               np.zeros((x11.shape[0],1)))).ravel()

I calculate the root mean squared error, and the coefficient of determination (R^2 score):

def fwd(x,w):
    return f(np.dot(x,w))

RMSE = 0

for i in range(N):
    RMSE += (fwd(x_test[i],w) - y_test[i])**2

RMSE = np.sqrt(RMSE/N)

print("RMSE", RMSE)

ybar = np.mean(y)

S = 0
D = 0
for i in range(N):
    S += (fwd(x_test[i],w) - y_test[i])**2
    D += (fwd(x_test[i],w) - ybar)**2

r2_score = 1 - S/D
print("r2_score", r2_score)

Result:

RMSE 0.09199468888373698
r2_score 0.9613632278609362

... or I am doing something wrong? Please tell me.

Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackexchange.com/rooms/118603/discussion-on-answer-by-vladislav-gladkikh-why-cant-the-xor-linear-inseparabili). — nbro, Jan 18 '21 at 16:24

score 3 · Answer 2 · answered Dec 09 '18 at 13:02

The main problems are that your activation function is not monotonic (as pointed out by csrev), and that it is not continuously differentiable. These make it very difficult / impossible to use standard gradient-based learning algorithms.

So yes, there may exist a good solution of weight values, but it is very difficult to find or approximate those weight values automatically through a learning algorithm. Also note that it completely breaks down as soon as you have a tiny error in one of the weights, even if it is approximate very closely; if one of the weights has a value of $0.999$ rather than $1.0$, the solution breaks down completely.

score 2 · Answer 3 · answered Dec 09 '18 at 01:44

2

Indeed I think the problem is with the way you've defined the activation function. By selecting it arbitrarily, you could solve many specific problems. In practice, activation functions used are monotonic. It keeps the error function convex at a per-layer level. In theory though I'm not sure exactly what Rosenblatt has claimed so it might be worth calling him

answered Dec 09 '18 at 01:44

csrev

31
4

If activation functions are not compulsorily monotonic (even if it may be the case in practice), then doesn't this activation function hold? – rahs Dec 09 '18 at 20:19
1

I guess it does hold then. But it ruins the whole purpose of the perceptron. As Dennis said, you did not specify what are the value between each integer, if you would start training and changing your weight you would go somewhere random. So you might as well skip the perceptron alltogether haha. Now I get that you're testing if there is a theoretical flaw. Only I don't know which theorem of Rosenblatt you are referring to, which would give the conditions. I would be curious to know though (as there is indeed one in that gist). – csrev Dec 10 '18 at 00:24

score 0 · Answer 4 · answered Dec 15 '18 at 20:29

0

I'm also working on a perceptron that is able to solve the XOR Problem, and I get interesting results using sine as an activation function, and using the derivative to make the perceptron learn. But you will need bias to make sure the perceptron is able to solve the problem.

answered Dec 15 '18 at 20:29

app_idea54

31
2

I think these links are relevant here: [this](https://stackoverflow.com/questions/30412427/solving-xor-with-single-layer-perceptron), [this](https://ai.stackexchange.com/questions/9534/non-monotonic-activation-function-and-xor-problem-with-perceptron) and [this](https://stackoverflow.com/questions/53785922/is-the-use-of-a-non-monotonic-activation-function-is-a-correct-and-viable-solut?noredirect=1#comment94423971_53785922) – Vladislav Gladkikh Dec 16 '18 at 02:37

score 0 · Answer 5 · answered Dec 16 '18 at 11:34

Another activation function that could be used for this problem: $$f(x) = \underset{i}{max}(x_i) - \underset{i}{min}(x_i)$$ It's not continuous, no backpropagation, sorry. Some other learning algorithm is required.

However, this answers the question, if an XOR can be solved with one neuron. Maybe this function is a solution of some learning problem with weights. Something like $$f(x) = max(w_0x_0,w_1x_1) - min(w_0x_0,w_1x_1)$$ I don't know how this creature is generalizable to other tasks, and how much learning can be done by just manipulating maxima and minima of weighted inputs. Any ideas?

Why can't the XOR linear inseparability problem be solved with one perceptron like this?

5 Answers5