Solved – lm() producing many NAs for coefficients

rregression

I am trying to run a regression using about 80 independent variables. The problem is that the last 20+ coefficients return NA. If I condense the range of data to within 60, I get coefficients for everything just fine. Why am I getting these NAs and how do I resolve it? I need to reproduce this code using all of these variables.

composite <- read.csv("file.csv", header = T, stringsAsFactors = FALSE)
composite <- subset(composite, select = -ncol(composite))
composite <- subset(composite, select = -Date)
model1 <- lm(indepvariable ~., data = composite, na.action = na.exclude)

composite is a data frame with 82 variables.

UPDATE:

What I have done is found a way to create an object that contains only the significantly correlated variables, to narrow the number of independent variables down.

I have a variable now: sigvars, which is the names of an object that sorted a correlation matrix and picked out only the variables with correlation coefficients >0.5 and <-0.5. Here is the code:

sortedcor <- sort(cor(composite)[,1])
regvar = NULL

k = 1
for(i in 1:length(sortedcor)){
  if(sortedcor[i] > .5 | sortedcor[i] < -.5){
    regvar[k] = i
  k = k+1
 }
}
regvar

sigvars <- names(sortedcor[regvar])

However, it is not working in my lm() function:

model1 <- lm(data.matrix(composite[1]) ~ sigvars, data = composite)

Error: Error in model.frame.default(formula = data.matrix(composite[1]) ~ sigvars, :
variable lengths differ (found for 'sigvars')

Best Answer

You're likely running into a degrees of freedom problem due to your high number of independent variables.

However, you can easily check for other problems with your regression by running the gvlma package. It's a highly lauded set of tests collectively called Global Validation of Linear Model Assumptions and was published in the Journal of the American Statistical Association.

Asides from degrees of freedom, you should be inspecting the underlying data for data errors, looking at multicolinearity in particular and doing other routine diagnostics such as testing the residuals (you can find them via residuals, summary, or in the model object), running a misspecification test (i.e. resettest or ramsey), etc.

Related Question