Solved – the information storage capacity of a neural network

information theoryneural networks

Firstly, a (feed forward) neural network can be thought as a lookup table. $f(x) = y$. So it is certainly storing some data.

The theoretical limit of data compression is -$\sum$ p $ln(p)$. The data in a neural network however is stored as weights and biases and not variable length coding, so does this equation even apply?

Say, I have a fully compressed data set of a certain size, will the size of weights and biases be the same if I have perfectly trained the network?

How do I know if my neural net has enough representative capacity or not to represent the entire data set?

Best Answer

Well, a table is definitely not the right way to look at this. Consider the function $f(x)=x^2$. I can create an infinite table of input-output pairs that are represented by this function. However I can only represent exactly 1 such table with this function. Now consider this function: $g(x)=c\cdot x^2$. For different values of $c$, this function can represent an infinite number of tables (even an uncountable number).
The best way that I can come up with to describe the information storage capacity of a neural network is to quote the universal approximation theorem: https://en.wikipedia.org/wiki/Universal_approximation_theorem. To summarize it, say we have an arbitrary continuous function and we want to approximate the output of this function. Now say that for every input, the output of our approximation shouldn't deviate more than some given $\epsilon>0$. Then we can create a neural network with a single hidden layer that satisfies this constraint, no matter the continuous function, no matter how small the error tolerance. The only requirement is that the amount of nodes in the hidden layer might grow arbitrarily large if we choose the error-rate smaller and smaller.

Related Solutions

Solved – How to visualize/understand what a neural network is doing

Neural networks are sometimes called "differentiable function approximators". So what you can do is to differentiate any unit with respect to any other unit to see what their relationshsip is.

You can check how sensitive the error of the network is wrt to a specific input as well with this.

Then, there is something called "receptive fields", which is just the visualization of the connections going into a hidden unit. This makes it easy to understand what particular units do for image data, for example. This can be done for higher levels as well. See Visualizing Higher-Level Features of a Deep Network.

Solved – How to incorporate the biases in the feed-forward neural network

Anyone new to NN may feel confused when first read NN tutorials with different notations. Some tutorials use 'biases', while others use 'bias units'. The ideas about the role of bias are just the same, which is well illustrated in this question, but the two notations are based on a slight implementation difference I think. The following two are for the same network with the same input layer and first hidden layer.

Implementation for 'biases':
The input layer with $m$ units is represented by a $1\times m$ matrix, $v$ here; the hidden layer with $n$ units is represented by a $1\times n$ matrix, $h$; the weights from the input to the hidden layer is represented by a $m\times n$ weight matrix, $w$; the bias to the hidden layer is represented by an another $1\times n$ matrix, $b$. A forward pass is carried out by $h = v * w + b$ and then apply activation function to $h$.

Implementation for 'bias units':
The input layer with $1\times (m+1)$ units is represented by a $1\times (m+1)$ matrix $v$, and the first unit is a bias unit with constant value $1$; the weight matrix from the input to the hidden layer is of size $(m+1) \times n$, and the first row's values are weights corresponding to the bias; the hidden layer has $n+1$ units in which the first unit is a bias unit with constant value $1$ not affected by forward passes. A forward pass is carried out by $h=v*w$ and then apply activation function to $h$.

The following image quoted from holehouse.org is an illustration of the second implementation.

Both of the two implementations are common, so deal with the question based on the notation. According to the given conditions, your question follows the first implementation. Suppose your v is a one unit vector [2.8], the following is an R implementation of the forward pass.

logistic <- function(vec){
  size = length(vec);
  for(i in 1:size){
    vec[i] = 1 / (1 + exp(-vec[i]));
  }
  return (vec);
}

v = c(2.8)
w = c(0.12,0.86,0.20,0.5)
b = c(7.12,-6.20,0.90,-3.6)
result = logistic(v%*%t(w) + b)
result
      [,1]       [,2]      [,3]       [,4]
[1,] 0.9994224 0.02205315 0.8115327 0.09975049

Besides, if it is the second implementation, the input layer becomes [1, 2.8], the biases are merged to the weight matrix, which becomes [7.12,−6.20,0.90,−3.6; 0.12,0.86,0.20,0.5], and the hidden layer has a bias unit.

v = c(1,2.8)
w = matrix (nrow = 2, ncol = 4)
w[1, ] = c(7.12,-6.20,0.90,-3.6);
w[2, ] = c(0.12,0.86,0.20,0.5);
result = logistic(v%*%w)
result
      [,1]       [,2]      [,3]       [,4]
[1,] 0.9994224 0.02205315 0.8115327 0.09975049
h = c(1, result);
h
[1] 1.00000000 0.99942237 0.02205315 0.81153267 0.09975049

Best Answer

Related Solutions

Solved – How to visualize/understand what a neural network is doing

Solved – How to incorporate the biases in the feed-forward neural network

Related Question