Solved – How and why would MLPs for classification differ from MLPs for regression? Different backpropagation and transfer functions

backpropagationmachine learningregression

I'm using two 3-layer feedforward multi-layer perceptrons (MLPs). With the same input data (14 input neurons), I do one classification (true/false), and one regression (if true, "how much")¹.
Until now, I've lazily used Matlabs patternnet and fitnet, respectively. Lazily, because I haven't taken the time to really understand what's going on — and I should. Moreover, I need to make the transition to an OSS library (probably FANN), that will likely require more manual setup than the Matlab NN Toolbox. Therefore, I'm trying to understand more precisely what's going on.

The networks created by patternnet and fitnet are nearly identical: 14 inputs neurons, 11 hidden neurons, 1 target neuron (2 for the fitnet, but only 1 piece of informatio). But, they're not completely identical. The differences by default are:

Should those differences be?

What kind of backpropagation functions are optimal for classification, and what kind for regression, and why?

What kind of transfer functions are optimal for classification, and what kind for regression, and why?


¹The classification is for "cloudy" or "cloud-free" (2 complementary targets), the regression is for quantifying "how much cloud" (1 target).

Best Answer

The key difference is in the training criterion. A least squares training criterion is often used for regression as this gives (penalised) maximum likelihood estimation of the model parameters assuming Gaussian noise corrupting the response (target) variable. For classification problems it is common to use a cross-entropy training criterion, to give maximum likelihood estimation assuming a Bernoilli or multinomial loss. Either way, the model outputs can be interpreted as estimate of the probability of class membership, but it is common to use logistic or softmax activation functions in the output layer so the outputs are constrained to lie between 0 and 1 and to sum to 1. If you use the tanh function, you can just remap these onto probabilities by adding one and dividing by two (but it is otherwise the same). tanh is a good choice for hidden layer activation functions.

The difference between scaled conjugate gradients and Levenberg-Marquardt are likely to be fairly minor in terms of generalisation performance.

I would strongly recommend the NETLAB toolbox for MATLAB over MATLABs own neural network toolbox. It is probably a good idea to investigate Bayesian regularisation to avoid over-fitting (Chris Bishop's book is well worth reading and most of it is covered in the NETLAB toolbox).

Related Question