Solved – Correlation matrix and redundant information

correlationneural networks

I am using a neural network model for a classification task with 13 inputs.
I study through the connection weights to depict the relevant variables. I have also made a correlation matrix to check the relationship between them:

Some groups of variables seem to have strong positive and negative relationships. My fear is that I would have to remove some because they are redundant (?). However, I may consider keeping them to let the network decide by itself which ones to use among those to get rid of.

Is it generally advised to remove redundant information (if highly correlated) when training neural networks?

My study aims at defining the best variables to use (for similar future classification task) so that we get the best prediction performance at the end. In this purpose, I have removed some of the highly correlated variables but got lower prediction accuracy.

Best Answer

Is it generally advised to remove redundant information (if highly correlated) when training neural networks?

It depends.

Is it necessary? No.

Since a neural network with an appropriate architecture can model any (!) function, you can safely assume, that it also could first model the PAT and then do whatever it also should do -- e.g. classification, regression, etc. (source)

and

In principal, the linear transformation performed by PCA can be performed just as well by by the input layer weights of the neural network, so it isn't strictly speaking necessary (source)

This is because Neural Nets could be used as a non-linear dimensionality reduction tool:

High-dimensional data can be converted to low-dimensional codes by training a multilayer neural network with a small central layer to reconstruct high-dimensional input vectors (source)

In this context, it is also worth mentioning auto-encoders.

Can it help? Yes, it speeds up things.

However, as the number of weights in the network increases, the amount of data needed to be able to reliably determine the weights of the network also increases (often quite rapidly), and over-fitting becomes more of an issue (using regularisation is also a good idea). The benefit of dimensionality reduction is that it reduces the size of the network, and hence the amount of data needed to train it (source)

Speed comes at a cost /slash/ bears a risk

The disadvantage of using PCA is that the discriminative information that distinguishes one class from another might be in the low variance components, so using PCA can make performance worse (source)

This might really be what you have experienced in your experiment.

What is singular matrix?

A square matrix is singular, that is, its determinant is zero, if it contains rows or columns which are proportionally interrelated; in other words, one or more of its rows (columns) is exactly expressible as a linear combination of all or some other its rows (columns), the combination being without a constant term.

Imagine, for example, a $3 \times 3$ matrix $A$ - symmetric, like correlaton matrix, or asymmetric. If in terms of its entries it appears that $\text {col}_3 = 2.15 \cdot \text {col}_1$ for example, then the matrix $A$ is singular. If, as another example, its $\text{row}_2 = 1.6 \cdot \text{row}_1 - 4 \cdot \text{row}_3$, then $A$ is again singular. As a particular case, if any row contains just zeros, the matrix is also singular because any column then is a linear combination of the other columns. In general, if any row (column) of a square matrix is a weighted sum of the other rows (columns), then any of the latter is also a weighted sum of the other rows (columns).

Singular or near-singular matrix is often referred to as "ill-conditioned" matrix because it delivers problems in many statistical data analyses.

What data produce singular correlation matrix of variables?

What must multivariate data look like in order for its correlation or covariance matrix to be a singular matrix as described above? It is when there is linear interdependances among the variables. If some variable is an exact linear combination of the other variables, with constant term allowed, the correlation and covariance matrces of the variables will be singular. The dependency observed in such matrix between its columns is actually that same dependency as the dependency between the variables in the data observed after the variables have been centered (their means brought to 0) or standardized (if we mean correlation rather than covariance matrix).

Some frequent particular situations when the correlation/covariance matrix of variables is singular: (1) Number of variables is equal or greater than the number of cases; (2) Two or more variables sum up to a constant; (3) Two variables are identical or differ merely in mean (level) or variance (scale).

Also, duplicating observations in a dataset will lead the matrix towards singularity. The more times you clone a case the closer is singularity. So, when doing some sort of imputation of missing values it is always beneficial (from both statistical and mathematical view) to add some noise to the imputed data.

Singularity as geometric collinearity

In geometrical viewpoint, singularity is (multi)collinearity (or "complanarity"): variables displayed as vectors (arrows) in space lie in the space of dimentionality lesser than the number of variables - in a reduced space. (That dimensionality is known as the rank of the matrix; it is equal to the number of non-zero eigenvalues of the matrix.)

In a more distant or "transcendental" geometrical view, singularity or zero-definiteness (presense of zero eigenvalue) is the bending point between positive definiteness and non-positive definiteness of a matrix. When some of the vectors-variables (which is the correlation/covariance matrix) "go beyond" lying even in the reduced euclidean space - so that they cannot "converge in" or "perfectly span" euclidean space anymore, non-positive definiteness appears, i.e. some eigenvalues of the correlation matrix become negative. (See about non-positive definite matrix, aka non-gramian here.) Non-positive definite matrix is also "ill-conditioned" for some kinds of statistical analysis.

Collinearity in regression: a geometric explanation and implications

The first picture below shows a normal regression situation with two predictors (we'll speek of linear regression). The picture is copied from here where it is explained in more details. In short, moderately correlated (= having acute angle between them) predictors $X_1$ and $X_2$ span 2-dimesional space "plane X". The dependent variable $Y$ is projected onto it orthogonally, leaving the predicted variable $Y'$ and the residuals with st. deviation equal to the length of $e$. R-square of the regression is the angle between $Y$ and $Y'$, and the two regression coefficients are directly related to the skew coordinates $b_1$ and $b_2$, respectively.

enter image description here

The picture below shows regression situation with completely collinear predictors. $X_1$ and $X_2$ correlate perfectly and therefore these two vectors coincide and form the line, a 1-dimensional space. This is a reduced space. Mathematically though, plane X must exist in order to solve regression with two predictors, - but the plane is not defined anymore, alas. Fortunately, if we drop any one of the two collinear predictors out of analysis the regression is then simply solved because one-predictor regression needs one-dimensional predictor space. We see prediction $Y'$ and error $e$ of that (one-predictor) regression, drawn on the picture. There exist other approaches as well, besides dropping variables, to get rid of collinearity.

enter image description here

The final picture below displays a situation with nearly collinear predictors. This situation is different and a bit more complex and nasty. $X_1$ and $X_2$ (both shown again in blue) tightly correlate and thence almost coincide. But there is still a tiny angle between, and because of the non-zero angle, plane X is defined (this plane on the picture looks like the plane on the first picture). So, mathematically there is no problem to solve the regression. The problem which arises here is a statistical one.

enter image description here

Usually we do regression to infer about the R-square and the coefficients in the population. From sample to sample, data varies a bit. So, if we took another sample, the juxtaposition of the two predictor vectors would change slightly, which is normal. Not "normal" is that under near collinearity it leads to devastating consequences. Imagine that $X_1$ deviated just a little down, beyond plane X - as shown by grey vector. Because the angle between the two predictors was so small, plane X which will come through $X_2$ and through that drifted $X_1$ will drastically diverge from old plane X. Thus, because $X_1$ and $X_2$ are so much correlated we expect very different plane X in different samples from the same population. As plane X is different, predictions, R-square, residuals, coefficients - everything become different, too. It is well seen on the picture, where plane X swung somewhere 40 degrees. In a situation like that, estimates (coefficients, R-square etc.) are very unreliable which fact is expressed by their huge standard errors. And in contrast, with predictors far from collinear, estimates are reliable because the space spanned by the predictors is robust to those sampling fluctuations of data.

Collinearity as a function of the whole matrix

Even a high correlation between two variables, if it is below 1, doesn't necessarily make the whole correlation matrix singular; it depends on the rest correlations as well. For example this correlation matrix:

1.000     .990     .200
 .990    1.000     .100
 .200     .100    1.000

has determinant .00950 which is yet enough different from 0 to be considered eligible in many statistical analyses. But this matrix:

1.000     .990     .239
 .990    1.000     .100
 .239     .100    1.000

has determinant .00010, a degree closer to 0.

Collinearity diagnostics: further reading

Statistical data analyses, such as regressions, incorporate special indices and tools to detect collinearity strong enough to consider dropping some of the variables or cases from the analysis, or to undertake other healing means. Please search (including this site) for "collinearity diagnostics", "multicollinearity", "singularity/collinearity tolerance", "condition indices", "variance decomposition proportions", "variance inflation factors (VIF)".

Deep Learning Variables – How to Identify Important Factors

What you describe is indeed one standard way of quantifying the importance of neural-net inputs. Note that in order for this to work, however, the input variables must be normalized in some way. Otherwise weights corresponding to input variables that tend to have larger values will be proportionally smaller. There are different normalization schemes, such as for instance subtracting off a variable's mean and dividing by its standard deviation. If the variables weren't normalized in the first place, you could perform a correction on the weights themselves in the importance calculation, such as multiplying by the standard deviation of the variable.

$I_i = \sigma_i\sum\limits_{j = 1}^{n_\text{hidden}}\left|w_{ij}\right|$.

Here $\sigma_i$ is the standard deviation of the $i$th input, $I_i$ is the $i$th input's importance, $w_{ij}$ is the weight connecting the $i$th input to the $j$th hidden node in the first layer, and $n_\text{hidden}$ is the number of hidden nodes in the first layer.

Another technique is to use the derivative of the neural-net mapping with respect to the input in question, averaged over inputs.

$I_i = \sigma_i\left\langle\left|\frac{dy}{dx_i}\right|\right\rangle$

Here $x_i$ is the $i$th input, $y$ is the output, and the expectation value is taken with respect to the vector of inputs $\mathbf{x}$.

Best Answer

Related Solutions

Singular Matrix – What Correlation Causes Singularity and Its Implications

What is singular matrix?

What data produce singular correlation matrix of variables?

Singularity as geometric collinearity

Collinearity in regression: a geometric explanation and implications

Collinearity as a function of the whole matrix

Collinearity diagnostics: further reading

Deep Learning Variables – How to Identify Important Factors

Related Question