Solved – Is the interpretation of Neural Network results correct

classificationmeasurement errormodel-evaluationneural networkstext mining

I use a neural network with a topology of 17-30-1 (sigmoid, atan error function, mse as cost function, 5-fold cv) for text classification. (It's closely related to a previous question of mine.)

The input data is quite noisy thus I could live with not a "perfect" classification score, but the results I get are probably too bad (or even just random) and thus I ask for your opinion.

The training error is around 0.06-0.09 (MSE), i.e. in average each classification differs approx. 0.25-0.3 from the predict label; in this binary case with a class threshold of 0.5 this might be acceptable. What do you think?
The test error (MSE) is unfortunately around 0.20 sometimes even 0.25; i.e. the effective error for a test sample is around 0.5, which to me means that the network a) suffers from high variance and b) is just as good as random guessing.

I don't need a perfect classification, but the network should however represent the patterns of the input data. But with this results I think the neural network is more or less useless or rather the input features are crap.

Best Answer

First, I would advise you to not use squared error but the cross entropy error. Squared error results from the assumption that your labels are subject to Gaussian noise, which will probably not be the case.

First, the output of your network should be a softmax:

$z_k = \frac{\exp{y_k}}{\sum_i\exp{y_i}}$

This is basically a logistic regression layer on top of the neural network, and gives you a proper probability. You can train that with the cross entropy error function (see here for an explanation)--the derivatives stay the same as for squared error and linear outputs.

Regarding the interpretation of the results: this is data set specific. If it is a hard task, that looks good. However, you should look at the acutal number of correct classifications in the end and see if that is good enough for your application. Anyway I think you will get better results if you use cross entropy.

Related Solutions

Solved – Verifying neural network model performance

Just to make sure we are on the same page: You have a sequence of 1000 samples with 7 features each. There is a sequential pattern in there, which is why you process them with an RNN. At each timestep:

It depends. It might get better if you use different normalizations, hard to tell.
To me it just sounds like classification. I am not sure what you mean by ranking exactly.
No reason to be skeptical. Normally, training error drops like that--extremly quick for few iterations, very slow afterwards.
No, absolutely not. For some tasks, less than 100 iterations (= passes over the training set) suffice.
You are the one who has to say whether the error is small enough. :) We can't tell you without knowing what you are using the network for.
Hard to tell. You should use early stopping instead. Train the network until the error on some held out validation set rises--that's the moment from which on you only overfit. Use the weights found then to evaluate on a test set. (That makes it three sets: training, validation, test set).

Here are some tips that I can give:

make sure to clamp your maximal updates to some fixed value. E.g. when you do a learning step, don't apply updates bigger than 0.1 (RPROP can already do this),
try Long Short-Term Memory,
try Hessian free optimization (Ilya Sutskever has code on his webpage).

Solved – How to incorporate the biases in the feed-forward neural network

Anyone new to NN may feel confused when first read NN tutorials with different notations. Some tutorials use 'biases', while others use 'bias units'. The ideas about the role of bias are just the same, which is well illustrated in this question, but the two notations are based on a slight implementation difference I think. The following two are for the same network with the same input layer and first hidden layer.

Implementation for 'biases':
The input layer with $m$ units is represented by a $1\times m$ matrix, $v$ here; the hidden layer with $n$ units is represented by a $1\times n$ matrix, $h$; the weights from the input to the hidden layer is represented by a $m\times n$ weight matrix, $w$; the bias to the hidden layer is represented by an another $1\times n$ matrix, $b$. A forward pass is carried out by $h = v * w + b$ and then apply activation function to $h$.

Implementation for 'bias units':
The input layer with $1\times (m+1)$ units is represented by a $1\times (m+1)$ matrix $v$, and the first unit is a bias unit with constant value $1$; the weight matrix from the input to the hidden layer is of size $(m+1) \times n$, and the first row's values are weights corresponding to the bias; the hidden layer has $n+1$ units in which the first unit is a bias unit with constant value $1$ not affected by forward passes. A forward pass is carried out by $h=v*w$ and then apply activation function to $h$.

The following image quoted from holehouse.org is an illustration of the second implementation.

Both of the two implementations are common, so deal with the question based on the notation. According to the given conditions, your question follows the first implementation. Suppose your v is a one unit vector [2.8], the following is an R implementation of the forward pass.

logistic <- function(vec){
  size = length(vec);
  for(i in 1:size){
    vec[i] = 1 / (1 + exp(-vec[i]));
  }
  return (vec);
}

v = c(2.8)
w = c(0.12,0.86,0.20,0.5)
b = c(7.12,-6.20,0.90,-3.6)
result = logistic(v%*%t(w) + b)
result
      [,1]       [,2]      [,3]       [,4]
[1,] 0.9994224 0.02205315 0.8115327 0.09975049

Besides, if it is the second implementation, the input layer becomes [1, 2.8], the biases are merged to the weight matrix, which becomes [7.12,−6.20,0.90,−3.6; 0.12,0.86,0.20,0.5], and the hidden layer has a bias unit.

v = c(1,2.8)
w = matrix (nrow = 2, ncol = 4)
w[1, ] = c(7.12,-6.20,0.90,-3.6);
w[2, ] = c(0.12,0.86,0.20,0.5);
result = logistic(v%*%w)
result
      [,1]       [,2]      [,3]       [,4]
[1,] 0.9994224 0.02205315 0.8115327 0.09975049
h = c(1, result);
h
[1] 1.00000000 0.99942237 0.02205315 0.81153267 0.09975049

Best Answer

Related Solutions

Solved – Verifying neural network model performance

Solved – How to incorporate the biases in the feed-forward neural network

Related Question