Solved – What are the internal differences between classification and regression neural networks

classificationregressiontensorflow

I am quite new to neural networks, so please forgive any stupidity I express. This question asks why the rounded output of a regression model is not similar to the output of a classification model and also asks how a measure of confidence in outputs can be extracted from a classification model.

I am using skflow and TensorFlow to test regression and classification models on a simple dataset provided by sklearn (the iris dataset). I am creating the models by specifying a similar architecture for each of them. When I train the models on the dataset, I note that the classifier seems to massively outperform the regressor in the way that I have specified things.

Why is the performance so different? How wrong am I to try to convert the regressor to a classifier by rounding its output? How could a classifier express its individual outputs together with a measure of its confidence in these individual outputs?

My code is as follows:

#!/usr/bin/env python

from __future__ import division

import random
import numpy
import sklearn
import sklearn.datasets
import sklearn.metrics
import skflow

def main():

    epochs = 10000

    random.seed(42)

    dataset = sklearn.datasets.load_iris()

    # Create regression and classification models.

    model_regression = skflow.TensorFlowDNNRegressor(
        hidden_units  = [200, 300, 300, 300, 200],
        n_classes     = 0,
        learning_rate = 0.1,
        steps         = epochs
    )

    model_classification = skflow.TensorFlowDNNClassifier(
        hidden_units  = [200, 300, 300, 300, 200],
        n_classes     = 3,
        learning_rate = 0.1,
        steps         = epochs
    )

    # Train.

    model_regression.fit(dataset.data, dataset.target)
    model_classification.fit(dataset.data, dataset.target)

    # Print a listing of the target, the regression prediction rounded and the
    # classification prediction.

    print(
        "target",
        "regression prediction rounded",
        "classification prediction"
    )
    for target, prediction_regression, prediction_classification in zip(
        dataset.target,
        list(model_regression.predict(dataset.data)),
        list(model_classification.predict(dataset.data))
    ):
        print(
            target,
            int(round(float(prediction_regression[0]))),
            prediction_classification
        )

    # Calculate the prediction accuracies for the regression rounded and the
    # classification.

    regression_predictions_rounded = numpy.array([
        round(value) for value in model_regression.predict(dataset.data)
    ])
    score_regression = sklearn.metrics.accuracy_score(
        regression_predictions_rounded,
        dataset.target
    )

    score_classification = sklearn.metrics.accuracy_score(
        model_classification.predict(dataset.data),
        dataset.target
    )

    print("regression prediction accuracy on training dataset: {percentage}".format(
        percentage = 100 * score_regression
    ))

    print("classification prediction accuracy on training dataset: {percentage}".format(
        percentage = 100 * score_classification
    ))

if __name__ == "__main__":
    main()

Best Answer

My answer might not be the full answer/truth you are looking for but 8 month after asking I feel confident that it's better than nothing ^^

General difference Regression/Classification:

Classification means predicting a discrete valued output (e.g. type 1 or 2 or 3)
Regression means predicting a continuous-valued output (e.g. housing prices related to house size)

If you take a look at the sklearn examples here you will realize that the iris data is used for Linear Classifier, Neural Network, and a Custom Model (also a DNN). All those have in common that they are for classification. Linear Regression on this page uses boston.data instead.

What I want to say is that there are algorithms that better fit your data and those that are worse. iris data seems to be meant to be classified (they are for flowers cite Wikipedia "this data set became a typical test case for many statistical classification") and boston.data seems to be continuous-valued via regression (boston house-prices etc., classic for regression).

I can't tell you to what degree it is wrong to convert the regressor to a classifier but it's using the wrong math for the problem. It might work but sure has a downside like maybe you need a lot more training data. I'm not that deep into the math, sorry. But what I can say is even if one algorithm outperforms the other on your machine it doesn't mean it would be faster if you ran it on your GPU instead of CPU or in the Cloud.

Either you have a performance issue because of massive datasets and/or hardware limitations and need an algorithm that might not take care of everything or you take what is allegedly the best and test and compare with alternatives. Often e.g. with large datasets you can just drop some random datasets and solve your performance issue that way.

Related Solutions

Solved – Improving the SVM classification of diabetes

I have 4 suggestions:

How are you choosing the variables to include in your model? Maybe you are missing some the key indicators from the larger dataset.
Almost all of the indicators you are using (such as sex, smoker, etc.) should be treated as factors. Treating these variables as numeric is wrong, and is probably contributing to the error in your model.
Why are you using an SVM? Did you try any simpler methods, such as linear discriminant analysis or even linear regression? Maybe a simple approach on a larger dataset will yield a better result.
Try the caret package. It will help you cross-validate model accuracy, it is parallelized which will let you work faster, and it makes it easy to explore different types of models.

Here is some example code for caret:

library(caret)

#Parallize
library(doSMP)
w <- startWorkers()
registerDoSMP(w)

#Build model
X <- train.set[,-1]
Y <- factor(train.set[,1],levels=c('N','Y'))
model <- train(X,Y,method='lda')

#Evaluate model on test set
print(model)
predY <- predict(model,test.set[,-1])
confusionMatrix(predY,test.set[,1])
stopWorkers(w)

This LDA model beats your SVM, and I didn't even fix your factors. I'm sure if you recode Sex, Smoker, etc. as factors, you will get better results.

Solved – Neural Networks Regression Model

These are different weight (coefficient) regularization methods. Weight regularization modifies the objective function that we minimize by adding additional terms that penalize large weights. The suitability of the specific weight regularization you want to apply depends on your data and use case.

Lasso (L1) regularization is widely used in domains with massive datasets where efficient and fast algorithms are essential. The lasso is not robust to high correlations among predictors and will arbitrarily choose one and ignore the others and break down when all predictors are identical. The lasso penalty expects many coefficients to be close to zero, and only a small subset to be larger (and nonzero).

Ridge (L2) regularization is ideal if there are many predictors all with non-zero coefficients and drawn from a normal distribution. It shrinks the coefficients of correlated predictors equally towards zero, and is therefore suitable for cases with many predictors, each having small effects on the outcome. L2 regularization prevents coefficients of linear regression models with many correlated variables from being poorly determined and exhibiting high variance.

Elastic net is an extension of the lasso that is robust to extreme correlations among the predictors. It uses a mixture of the L1 (lasso) and L2 (ridge) penalties, and was first proposed for analyzing high dimensional data.

For further details on these three, see [1].

A final, (deep) neural network specific weight regularization with multiple hidden layers is the Max norm constraint.

Max Norm regularization has a similar goal of attempting to restrict the weights from becoming too large. Max norm constraints enforce an absolute upper bound on the magnitude of the incoming weight vector for every neuron and use projected gradient descent to enforce the constraint. In other words, anytime a gradient descent step moved the incoming weight vector such that L2||w||>c, we project the vector back onto the sphere (centered at the origin) with radius c.

Alternative ways to prevent over-fitting

Note that you can regularize a neural net with other methods as well, such as:

i. modifying the number of hidden units, where higher number of hidden units refer to a more complex model, and you can carry out model comparison by looking at the goodness of fit or MLE and a complexity penalty as in AIC or BIC.

ii. earlystopping - where you stop training your model once you observe that generalization is deteriorating, i.e. the test error starts growing.

iii. apply drop-out to prevent co-adaptation of neurons. This has a lengthier explanation, I am giving you the link to the original Srivastava & Hinton, et al paper instead.

Best Answer

Related Solutions

Solved – Improving the SVM classification of diabetes

Solved – Neural Networks Regression Model

Related Question