Solved – Understanding probabilistic neural networks

machine learningneural networks

I would like to understand the basic concepts of probabilistic neural networks better. Unfortunately so far I have not found a resource which answers all the questions I have. So far my understanding and my questions are as follows:

  • The first layer ("input layer") represents each feature as a node
  • The next layers are the hidden layers: Here we calculate the distance from the data sample (vector) we want to classify, to the average data vector of each class
  • "Summation layer": ?? What exactly happens here? Do the hidden layers calculate the distance of the new data vector to each of the training vectors, and the summation layer sums up all the distances for each of the classes..? Or..?
  • If I understand it correctly, the data from each class is modeled by a gaussian distribution, and the parameters of the gaussians are fit during training. Is it not enough then to calculate the probability of a new vector as coming from either gaussian? How does the distance calculation is important here?

Many thanks

Best Answer

PNN are easy to understand when taking an example. So let's say I want to classify with a PNN points in 2D and my training points are the blue and red dots in the figure: Inputs labelized (blue and red dots) and test point (green cross)

I can take as base function a gaussian of variance, say 0.1. There's no training in a PNN as soon as the variance $\sigma$ of the Gaussian is fixed, so we'll now get to the core of your questions with this fixed $\sigma$ (you could of course try to find an optimal $\sigma$...).

So I want to classify this green cross ($x=1.2$, y=$0.8$). What PNN does is the following:

  1. the input layer is the feature vector ($x=1.2$, y=$0.8$);
  2. the hidden layer is composed of six nodes (corresponding to the six training dots) : each node evaluates the gaussian centered at its plot, e.g. if the first node is the blue one at ($x_0=-1.2$, $y_0=-1.1$), then this node evaluates $G_0(x,y)\approx \exp(-((x-x_0)^2 + (y-y_0)^2)/2\sigma^2)$. At the end each of the nodes output the value of his gaussian for the green cross (often you threshold when the value is too low).
  3. the summation layer is composed of $\#(labels)$ nodes. Each real value output of step 2 is sent to the correponding node (the three red node send their values to the summation red node and the three blue nodes send their values to the summation blue node). Each of the label node sums the guassian values they received.
  4. the last node is just a max node that takes all the outputs of the summation nodes and outputs the max, e.g. the label node that had the highest score.

Here you can see that every blue point will have a gaussian (with variance $\sigma=0.1$) equal to 0 whereas the red ones will have quite high values. Then the summation of all the blue gaussians will be 0 (or almost) and the red one high, so the max is red label : the green cross is categorized as red.

As you pointed out, the main task is to find this $\sigma$. There are a lot of techniques and you can find a lot of training strategies on the internet. You have to take a $\sigma$ small enough to capture the locality and not to small otherwise you overfit. You can imagine cross-validating to take the optimal one inside a grid! (Note also that you could assign a different $\sigma_i$ to each label e.g.).

Here's a video which is well done!

Related Question