I am wondering how to adjust the Adaline algorithm to classify the classes 0 and 1 instead of -1 and 1.
I found a section in
Neural Networks and Statistical Learning by Ke-Lin Du, M. N. S. Swamy that confused me a little bit. Here is a link to the relevant paragraph on Google books, and there is a screenshot below:
The original Adaline paper by Widrow can be found here: Adaptive ”Adaline” neuron using chemical ”memistors”
What I find particularly confusing is that it reads like that the scenarios {0, 1} and {+1, -1} can be trained equally.
Similarly, I found the same thing on Wikipedia for the Perceptron algorithm:
Let's start with the class -1 and 1 case. For simplicity, let's say our net input is $\mathbf{z} = \mathbf{w}^T\mathbf{x}$, and the activation function $g(\mathbf{z})$ is the identity function $g(\mathbf{z}) = \mathbf{z}$:
$$\begin{equation}
g({\mathbf{z}}) =\begin{cases}
1 & \text{if $\mathbf{z} > 0$}\\
-1 & \text{otherwise}.
\end{cases}
\end{equation}$$
and
$$\mathbf{z} = w_0x_{0} + w_1x_{1} + \dots + w_mx_{m} = \sum_{j=0}^{m} x_{j}w_{j} \\ = \mathbf{w}^T\mathbf{x}.$$
And the learning rule is
$\Delta w_0 = \eta(\text{target}^{(i)} – \text{output}^{(i)})$
$\Delta w_1 = \eta(\text{target}^{(i)} – \text{output}^{(i)})\;x^{(i)}_{1}$
$\Delta w_2 = \eta(\text{target}^{(i)} – \text{output}^{(i)})\;x^{(i)}_{2}$
Based on my understanding, this is results in a linear function that passes through the origin (because of $w_0$):
And we are "squashing" the output via the unit step:
To make sure that it works, let me implement it in simple Python code:
import numpy as np
class Adaline(object):
def __init__(self, eta=0.01, epochs=50):
self.eta = eta
self.epochs = epochs
def train(self, X, y):
self.w_ = np.zeros(1 + X.shape[1])
for i in range(self.epochs):
for xi, target in zip(X, y):
output = self.net_input(xi)
error = (target - output)
self.w_[1:] += self.eta * xi.dot(error)
self.w_[0] += self.eta * error
return self
def net_input(self, X):
return np.dot(X, self.w_[1:]) + self.w_[0]
def activation(self, X):
return self.net_input(X)
def predict(self, X):
return np.where(self.activation(X) > 0.0, 1, -1)
X = np.array([[1.1, 1.2], [1.4, 1.8], [3.2, 4.2], [5.5, 5.9]])
y = np.array([-1, -1, 1, 1])
ada = Adaline()
ada.train(X, y)
print(ada.predict(X))
print(ada.w_)
and print the results:
[-1 -1 1 1]
[-0.59518362 0.08374251 0.19489769]
However, this doesn't work if I'd just change it to
y = np.array([0, 0, 1, 1])
and
def predict(self, X):
return np.where(self.activation(X) > 0.0, 1, 0)
Which is due to the unit step looking as follows now in in the 0-1 class scenario:
And $g(\mathbf{z})$ becomes $g(\mathbf{z}) = \mathbf{z} – w_0$
So that
def net_input(self, X):
return np.dot(X, self.w_[1:]) + self.w_[0]
def activation(self, X):
return self.net_input(X)
def predict(self, X):
return np.where(self.activation(X) - self.w_[0] >= 0.0, 1, 0)
Does this make any sense?
Best Answer
Your code is correct and problem is in different way.
Your weight update is proportional to your input and weights for class
1
updates much faster than for class0
(because you inputs for class1
contains bigger numbers than for class0
). For first epoch first two0
classes get you zero updates because they are perfectly correct (your weight zero by default and first outputs are zero). The third output for first1
class give you bad result and you update your weights, and the same problem with next1
class input. So your epoch is end and you will get updated weight which are more than zero and after product with your weights give you result more than $0$. As I say before weights for class1
update much faster. In epoch 20
classes try minimize previous epoch update after class1
but there update 'power' is not enough to get result less than zero. In next epoch you will get the same picture and class1
makes greater contribution in weight updates than zero classes.I can se few solutions for you:
1) Your weights is also zero and you always use a bad start point for your training. Solution - you can use random standard distribution weights for your learning
2) You can setup your activation function where step bound will be 0.5 (the half way between 0 and 1). But I'am not realy sure that your learning proccess will be stable for very big epoch size