MATLAB: How to further debug this code

debuggingoverfittingtutorial

This is probably easy for most people, but not me. I separately ran each of these "P" sections (P=1 to P = 4 for offline learning – neural networks) in their own loop and successfully reached an error (E) of less than 0.001, pretty quickly.
But the combined error of P = 1,2,3,and 4 together don't decrease to below 0.004 as I would like, instead they're stuck on a summed error of around 0.64 and even plotting the error of combining only P=1 and 2 (like I've done in the copy of my code below – which is why P = 3 and P = 4 are so jumbled) gets stuck at around 0.32. That's the extent of my ability to debug this code.
Can anyone see what obvious mistake I'm making (or otherwise optimize my crude coding)? Because I can't.
Thanks in advance!
clc;
clear all;
close all;
eta = -1;
x1 = 0.1;
x2 = 0.1;
x3 = 1;
w1(1)=1*(rand(1)-0.5);
w2(1)=-1*(rand(1)-0.5);
w3(1)=2.3*(rand(1)-0.5);
w4(1)=2.1*(rand(1) - 0.5);
w5(1)=-2*(rand(1) - 0.5);
w6(1)=-2*(rand(1)-0.5);
w7(1)=1*(rand(1) - 0.5);
w8(1)=2*(rand(1) - 0.5);
w9(1)=2*(rand(1) - 0.5);
i=1;
for icount=1:10000
%P = 1
x1 = 0.1;
x2 = 0.1;
alpha1 = w1(i)*x1 + w3(i)*x2 + w5(i)*x3;
z1 = 1./(1 + exp(-alpha1));
alpha2 = w2(i)*x1 + w4(i)*x2 + w6(i)*x3;
z2 = 1./(1 + exp(-alpha2));
alpha3 = w7(i)*z1 + w8(i)*z2 + w9(i)*x3;
y1 = 1./(1 + exp(-alpha3));
%Hidden layer gate z1

changew11 = eta*x1*z1*(1-z1)*w7(i)*y1*(1-y1)*(y1-0.1);
changew31 = eta*x2*z1*(1-z1)*w8(i)*y1*(1-y1)*(y1-0.1);
changew51 = eta*x3*z1*(1-z1)*w9(i)*y1*(1-y1)*(y1-0.1);
%Hidden layer gate z2

changew21 = eta*x1*z2*(1-z2)*w7(i)*y1*(1-y1)*(y1-0.1);
changew41 = eta*x2*z2*(1-z2)*w8(i)*y1*(1-y1)*(y1-0.1);
changew61 = eta*x3*z2*(1-z2)*w9(i)*y1*(1-y1)*(y1-0.1);
%Output layer

changew71 = eta*z1*y1*(1-y1)*(y1-0.1);
changew81 = eta*z2*y1*(1-y1)*(y1-0.1);
changew91 = eta*x3*y1*(1-y1)*(y1-0.1);
E1(i) = (y1-0.1)^2;
%P = 2
x1 = 0.1; x2 = 0.9;
alpha1 = w1(i)*x1 + w3(i)*x2 + w5(i)*x3;
z1 = 1./(1 + exp(-alpha1));
alpha2 = w2(i)*x1 + w4(i)*x2 + w6(i)*x3;
z2 = 1./(1 + exp(-alpha2));
alpha3 = w7(i)*z1 + w8(i)*z2 + w9(i)*x3;
y2 = 1./(1 + exp(-alpha3));
%Hidden layer gate z1
changew12 = eta*x1*z1*(1-z1)*w7(i)*y2*(1-y2)*(y2-0.9);
changew32 = eta*x2*z1*(1-z1)*w8(i)*y2*(1-y2)*(y2-0.9);
changew52 = eta*x3*z1*(1-z1)*w9(i)*y2*(1-y2)*(y2-0.9);
%Hidden layer gate z2
changew22 = eta*x1*z2*(1-z2)*w7(i)*y2*(1-y2)*(y2-0.9);
changew42 = eta*x2*z2*(1-z2)*w8(i)*y2*(1-y2)*(y2-0.9);
changew62 = eta*x3*z2*(1-z2)*w9(i)*y2*(1-y2)*(y2-0.9);
%Output layer
changew72 = eta*z1*y2*(1-y2)*(y2-0.9);
changew82 = eta*z2*y2*(1-y2)*(y2-0.9);
changew92 = eta*x3*y2*(1-y2)*(y2-0.9);
E2(i) = (y2-0.9)^2;
% %P = 3
% x1 = 0.9;
% x2 = 0.1;
% alpha1 = w1(i)*x1 + w3(i)*x2 + w5(i)*x3;

% z1 = 1./(1 + exp(-alpha1));

% alpha2 = w2(i)*x1 + w4(i)*x2 + w6(i)*x3;

% z2 = 1./(1 + exp(-alpha2));

% alpha3 = w7(i)*z1 + w8(i)*z2 + w9(i)*x3;

% y3 = 1./(1 + exp(-alpha3));
% %Hidden layer gate z1

% changew13 = eta*x1*z1*(1-z1)*w7(i)*y3*(1-y3)*(y3-0.9);
% changew33 = eta*x2*z1*(1-z1)*w8(i)*y3*(1-y3)*(y3-0.9);
% changew53 = eta*x3*z1*(1-z1)*w9(i)*y3*(1-y3)*(y3-0.9);
% %Hidden layer gate z2

% changew23 = eta*x1*z2*(1-z2)*w7(i)*y3*(1-y3)*(y3-0.9);
% changew43 = eta*x2*z2*(1-z2)*w8(i)*y3*(1-y3)*(y3-0.9);
% changew63 = eta*x3*z2*(1-z2)*w9(i)*y3*(1-y3)*(y3-0.9);
% %Output layer

% changew73 = eta*z1*y3*(1-y3)*(y3-0.9);
% changew83 = eta*z2*y3*(1-y3)*(y3-0.9);
% changew93 = eta*x3*y3*(1-y3)*(y3-0.9);
% E3(i) = (y3-0.9)^2;
% %P = 4 % x1 = 0.9; x2 = 0.9; x3 = 1;
% alpha1 = w1(i)*x1 + w3(i)*x2 + w5(i)*x3;
% z1 = 1./(1 + exp(-alpha1));
% alpha2 = w2(i)*x1 + w4(i)*x2 + w6(i)*x3;
% z2 = 1./(1 + exp(-alpha2));
% alpha3 = w7(i)*z1 + w8(i)*z2 + w9(i)*x3;
% y4 = 1./(1 + exp(-alpha3));
% %Hidden layer gate z1
% changew14 = eta*x1*z1*(1-z1)*w7(i)*y4*(1-y4)*(y4-0.1);
% changew34 = eta*x2*z1*(1-z1)*w8(i)*y4*(1-y4)*(y4-0.1);
% changew54 = eta*x3*z1*(1-z1)*w9(i)*y4*(1-y4)*(y4-0.1);
% %Hidden layer gate z2
% changew24 = eta*x1*z2*(1-z2)*w7(i)*y4*(1-y4)*(y4-0.1);
% changew44 = eta*x2*z2*(1-z2)*w8(i)*y4*(1-y4)*(y4-0.1);
% changew64 = eta*x3*z2*(1-z2)*w9(i)*y4*(1-y4)*(y4-0.1);
% %Output layer
% changew74 = eta*z1*y4*(1-y4)*(y4-0.1);
% changew84 = eta*z2*y4*(1-y4)*(y4-0.1);
% changew94 = eta*x3*y4*(1-y4)*(y4-0.1);
% E4(i) = (y4-0.1)^2;
sumE(i) = E1(i) + E2(i); %+ E3(i) + E4(i);
if sumE(i)<=0.004
break
end
i=i+1;
w1(i) = w1(i-1) + changew11+changew12;%+changew13+changew14;
w2(i) = w2(i-1) + changew21+changew22;%+changew23+changew24;
w3(i) = w3(i-1) + changew31+changew32;%+changew33+changew34;
w4(i) = w4(i-1) + changew41+changew42;%+changew43+changew44;
w5(i) = w5(i-1) + changew51+changew52;%+changew53+changew54;
w6(i) = w6(i-1) + changew61+changew62;%+changew63+changew64;
w7(i) = w7(i-1) + changew71+changew72;%+changew73+changew74;
w8(i) = w8(i-1) + changew81+changew82;%+changew83+changew84;
w9(i) = w9(i-1) + changew91+changew92;%+changew93+changew94;
end
figure(1);
grid on;
title('W values 1-9 Vs Iteration Number');
hold on;
plot(w1,'red');
plot(w2,'green');
plot(w3,'blue');
plot(w4,'cyan');
plot(w5,'magenta');
plot(w6,'yellow');
plot(w7,'black');
plot(w8,':red');
plot(w9,'-.green');
legend('w1','w2','w3','w4','w5','w6','w7','w8','w9','Location','Best');
figure(2);
grid on;
title('Error Vs Iteration Number');
hold on;
plot(sumE);

Best Answer

1. If the number of unknowns is greater than the number of equations, then a solution is not unique (How many solutions to x1 + x2 = 1?).
2. A net with more unknown weights than the number of training equations is said to be OVERFIT. The nonuniqueness of exact solutions is typically mitigated by various techniques mentioned below.
3. If an overfit net is trained with data consisting of signal + random contamination (noise, measurement error, roundoff and/or truncation error). A LMSE (least-mean-square-error) solution obtained from a signal with a particular set of contamination may yield a large MSE for the same signal with different contamination.
4. A net that performs well on nontraining data, that can be assumed to be drawn from the same source as the training data, is said to have good generalization, i.e., it generalizes well to nontraining data.
5. If a net is overfit but the signal to contamination power ratio is sufficiently high, iterative solutions tend to pass though regions of good generalization on the way to minimizing the training MSE. Such nets are said to be OVERTRAINED.
6. There are several methods to mitigate overtraining an overfit net. See the comp.ai.neural-nets FAQ and search for overfit, overfitting and/or generalization.
7. For a single hidden layer MLP with H hidden nodes and an I-H-O node topology trained by Ntrn pairs of I-dimensional inputs and O-dimensional outputs:
Ntrneq = Ntrn*O % No. of training equations
Nw = (I+1)*H+(H+1)*O % No. of unknown weights.
Typically, Ntrn, I and O are given and a choice of H has to be made. To avoid overfitting, choose H to be less than the upperbound
Hub = -1 + ceil( Ntrneq-O)/(I+O+1) ).
Sometimes this can be achieved by reducing I, O, and/or H by pruning connections.
8. If avoiding overfitting does not yield an acceptable solution, then there are other mitigation techniques for not overtraining an overfit net(See the comp.ai.neural-nets FAQ):
a. Validation set stopping
b. Regularization of the minimization objective
1. Weight decay
2. Weight elimination
3. Bayesian regularization
c. Jittering(Training with added noise)
Bottom Line:
If you have 9 unknown weights you might want at least 45 or 90 equations or else use a mitigation technique.
Hope this helps.
Thank you for formally accepting ny answer
Greg
Related Question