Regression – Minimum Number of Observations for Multiple Linear Regression

multiple regressionregressiont-test

I am doing multiple linear regression. I have 21 observations and 5 variables. My aim is just finding the relation between variables

  1. Is my data set enough to do multiple regression?
  2. The t-test result revealed 3 of my variables are not significant. Do I need to do my regression again with the significant variables (or my first regression is enough to get conclusion)?
    My correlation matrix is as follow

           var 1   var 2    var 3   var 4   var 5     Y
    var 1   1.0     0.0       0.0   -0.1    -0.3    -0.2
    var 2   0.0     1.0       0.4    0.3    -0.4    -0.4
    var 3   0.0     0.4       1.0    0.7    -0.7    -0.6
    var 4  -0.1     0.3       0.7    1.0    -0.7    -0.9
    var 5  -0.3    -0.4      -0.7   -0.7    1.0      0.8
    Y      -0.2    -0.4      -0.6   -0.9    0.8      1.0
    

var 1 and var 2 are continues variables and var 3 to 5are categorical variables and y is my dependent variable .

It should be mentioned the important variable which has been considered in the literature as the most influential factor on my dependent variable is not also among my regression variables due to my data limitation. Does still make sense to do regression without this important variable?

here is my confidence interval

    Varibales   Regression Coefficient  Lower 95% C.L.  Upper 95% C.L.
    Intercept   53.61                       38.46        68.76
    var 1       -0.39                      -0.97         0.19
    var 2       -0.01                      -0.03         0.01
    var 3        5.28                      -2.28         12.84
    var 4       -27.65                     -37.04       -18.26
    **var 5      11.52                      0.90         22.15**

Best Answer

The general rule of thumb (based on stuff in Frank Harrell's book, Regression Modeling Strategies) is that if you expect to be able to detect reasonable-size effects with reasonable power, you need 10-20 observations per parameter (covariate) estimated. Harrell discusses a lot of options for "dimension reduction" (getting your number of covariates down to a more reasonable size), such as PCA, but the most important thing is that in order to have any confidence in the results dimension reduction must be done without looking at the response variable. Doing the regression again with just the significant variables, as you suggest above, is in almost every case a bad idea.

However, since you're stuck with a data set and a set of covariates you're interested in, I don't think that running the multiple regression this way is inherently wrong. I think the best thing would be to accept the results as they are, from the full model (don't forget to look at the point estimates and confidence intervals to see whether the significant effects are estimated to be "large" in some real-world sense, and whether the non-significant effects are actually estimated to be smaller than the significant effects or not).

As to whether it makes any sense to do an analysis without the predictor that your field considers important: I don't know. It depends what kind of inferences you want to make based on the model. In the narrow sense, the regression model is still well-defined ("what are the marginal effects of these predictors on this response?"), but someone in your field might quite rightly say that the analysis just doesn't make sense. It would help a little bit if you knew that the predictors you have are uncorrelated from the well-known predictor (whatever it is), or that well-known predictor is constant or nearly constant for your data: then at least you could say that something other than the well-known predictor does have an effect on the response.

Related Question