Solved – Understanding probabilistic neural networks

machine learningneural networks

I would like to understand the basic concepts of probabilistic neural networks better. Unfortunately so far I have not found a resource which answers all the questions I have. So far my understanding and my questions are as follows:

The first layer ("input layer") represents each feature as a node
The next layers are the hidden layers: Here we calculate the distance from the data sample (vector) we want to classify, to the average data vector of each class
"Summation layer": ?? What exactly happens here? Do the hidden layers calculate the distance of the new data vector to each of the training vectors, and the summation layer sums up all the distances for each of the classes..? Or..?
If I understand it correctly, the data from each class is modeled by a gaussian distribution, and the parameters of the gaussians are fit during training. Is it not enough then to calculate the probability of a new vector as coming from either gaussian? How does the distance calculation is important here?

Many thanks

Best Answer

PNN are easy to understand when taking an example. So let's say I want to classify with a PNN points in 2D and my training points are the blue and red dots in the figure:

I can take as base function a gaussian of variance, say 0.1. There's no training in a PNN as soon as the variance $\sigma$ of the Gaussian is fixed, so we'll now get to the core of your questions with this fixed $\sigma$ (you could of course try to find an optimal $\sigma$...).

So I want to classify this green cross ($x=1.2$, y=$0.8$). What PNN does is the following:

the input layer is the feature vector ($x=1.2$, y=$0.8$);
the hidden layer is composed of six nodes (corresponding to the six training dots) : each node evaluates the gaussian centered at its plot, e.g. if the first node is the blue one at ($x_0=-1.2$, $y_0=-1.1$), then this node evaluates $G_0(x,y)\approx \exp(-((x-x_0)^2 + (y-y_0)^2)/2\sigma^2)$. At the end each of the nodes output the value of his gaussian for the green cross (often you threshold when the value is too low).
the summation layer is composed of $\#(labels)$ nodes. Each real value output of step 2 is sent to the correponding node (the three red node send their values to the summation red node and the three blue nodes send their values to the summation blue node). Each of the label node sums the guassian values they received.
the last node is just a max node that takes all the outputs of the summation nodes and outputs the max, e.g. the label node that had the highest score.

Here you can see that every blue point will have a gaussian (with variance $\sigma=0.1$) equal to 0 whereas the red ones will have quite high values. Then the summation of all the blue gaussians will be 0 (or almost) and the red one high, so the max is red label : the green cross is categorized as red.

As you pointed out, the main task is to find this $\sigma$. There are a lot of techniques and you can find a lot of training strategies on the internet. You have to take a $\sigma$ small enough to capture the locality and not to small otherwise you overfit. You can imagine cross-validating to take the optimal one inside a grid! (Note also that you could assign a different $\sigma_i$ to each label e.g.).

Here's a video which is well done!

Related Solutions

Solved – Incremental training of Neural Networks

I would suggest you to use Transfer Learning Techniques. Basically, it transfers the knowledge in your big and old dataset to your fresh and small dataset.

Try reading: A Survey on Transfer Learning and the algorithm TrAdaBoost.

Solved – How to calculate output of a Neural Network

Unlike people mentioned. Inputs should not be binary. They should be between a certain range (0,1 for sigmoid, -1,1 for TanH).

On the first part you are exactly right if you don't account for bias.

// Completely right, each hidden node gets input from 2 input nodes
activationFunction((1 * .25) + (1 * .10)) // 0.5866175789173301
activationFunction((0 * .40) + (1 * .60)) // 0.6456563062257954
activationFunction((1 * .20) + (0 * .80)) // 0.549833997312478

// However, all the hidden nodes are connected the output node
output = activationFunction((0.59 * weight1) + (0.64 * weight2) + (0.55 * weight3))

Always keep in mind that nodes can only be connected to other nodes by connections, which always have a weight.

My question is, if you're feeding in two scaled numbers to predict grades, 89 & 6.5 = (grade/hours of sleep)

First you scale the inputs (read more here):

89 > 0.89
6.5 > 6.4 / 24 = 0.27

So if the new grade you got was 100, and your output was 0.8559 then the error on your output node is 1.00 - 0.8559 = 0.1441. Then you backpropagate this through the network, but i'm not the right one to explain that for you.

Best Answer

Related Solutions

Solved – Incremental training of Neural Networks

Solved – How to calculate output of a Neural Network

Related Question