Solved – Sum of variances from regression coefficients, larger then total variance. Why

regressionvariance

For a linear regression fit for a problem with p variables X_i ranging between 0 and 1, where p>20 (I don't know if that is relevant or not), and the number of samples is about 1000, I wanted to estimate the variance contribution for each of the variables using the regression coefficients. If I understood correctly var(A*X) = A^2*var(X), and therefore I thought that taking the square of the regression coefficients and multiplying that with the variance for each of the variables should give a vector containing all the variance contributions of the different variables.

The problem is that I expected the sum of those variance contributions to be equal to the sum of the total regression model variance, but it isn't. The sum of the variances is some times up to 50% larger, then the total variance of the regression model.

here some Matlab like pseudo code to explain the problem more in detail.

X    %sample matrix    
Y    %output sample matrix 

 Linmodel=polyfitn(X,Y)   %fit a model    
 for ii=1:nr_colsX
      VARCONT(ii)=var(Linmodel.COEF(ii)*X(:,ii))) %variance of the contributions    
      VARRC(ii)=(Linmodel.COEF(ii)).^2*var(X(:,ii))  %variance based on Reg. Coef.    
 end

 SVC=sum(VARCONT(ii))  %sum of the variance contributions    
 SVSRC=sum(VARRC)    
 VY=var(Y)        %sum of the variance of the samples    
 VYmod=var(polyvaln(Linmodel,X))    %sum of the variance of the model on the samples    
 XR=rand(100000,nr_colsX)     %sum of the variance of the model with large number of samples    
 VYmodR=var(polyvaln(Linmodel,XR))

for one of the models that are supposed to be almost linear, VY is almost equal to VYmod.
but SVC is about 50% largen then that. and VYmodR comes more in the direction of SVC.

1) Could some body please explain me why the sum of the variance contributions from the Regression coefficients, can be quite a bit larger then the variance of the regression model?

2) If this is so as it seems to be the case should there then not be some sort of upper bound for the sum of the square of the regression coefficients, such that the sum of their squares, multiplied by the variances of the input, should not be larger then the total variance of the output? Because it seems strange to me that the output of an interpolation model could result in a larger variance, then the variance of the data output points used for the interpolation.

Any help is highly appreciated, but any help that is written in a way, such that also just a silly engineer as me can understand it is appreciated even more.

Best Answer

Let $X_1$ and $X_2$ denote two random variables with variances $\sigma^2_{X1}$ and $\sigma^2_{X2}$. Let $Y = X_1 + X_2$. The variance of $Y$ is then equal to $\sigma^2_{X1} + \sigma^2_{X2} + \rho \sigma_{X1} \sigma_{X2}$, where $\rho$ is the correlation between $X_1$ and $X_2$. So, unless the two variables are uncorrelated, the sum of the variances will not be equal to variance of $Y$. Depending on the sign of $\rho$, the actual variance of $Y$ may be larger or smaller than the sum of $\sigma^2_{X1}$ and $\sigma^2_{X2}$.

Related Question