Solved – Error with lm in R

linear modellmr

I'm taking a class on R and I cannot get the professors code to work. I am trying to do a simple linear model and I run this code:

ozone<-read.table("http://www.ats.ucla.edu/stat/r/faq/ozone.csv", sep=",", header=T)

fit = lm(ozone~.,data=ozone)

summary(fit)

Which keeps giving me the following error:

Error in model.frame.default(formula = ozone ~ ., data = ozone, drop.unused.levels = TRUE) :
invalid type (list) for variable 'ozone

It's really depressing as they are the first two lines of code in his lecture notes. I have also found several other forum posts on this topic (it's even listed as a common r mistake), but I am too…special to figure out how to change it.

I tried reading it as.numeric, and as a data.frame, which is what most other threads suggested, but neither worked.

Best Answer

If we look on the first rows of the data frame ozone, we see that variable names (columns) are Station, Av8top, Lat and Lon.

> head(ozone)
  Station   Av8top      Lat       Lon
1      60 7.225806 34.13583 -117.9236
2      69 5.899194 34.17611 -118.3153
3      72 4.052885 33.82361 -118.1875
4      74 7.181452 34.19944 -118.5347
5      75 6.076613 34.06694 -117.7514
6      84 3.157258 33.92917 -118.2097

When you fit model with code lm(ozone~.,data=ozone), you are telling R that there is variable ozone (column) in your data that should be dependent variable. But there are no column named ozone. So either other column should be used as dependent variable or the file you are downloading is wrong.

Related Solutions

Solved – R: Calculating mean and standard error of mean for factors with lm() vs. direct calculation -edited

The difference in standard errors are because in the regression you compute a combined estimate of the variance, while in the other calculation you compute separate estimates of the variance.

Solved – How to use a linear model with two factors and repeated measures

You're right that not incorporating subject somehow will violate the assumptions of the model. In general, the point estimates will be unbiased, but your interval estimates (confidence intervals, p-values, etc.) will not have the right width. Typically, they would be too narrow, although they can be too wide sometimes. There are several ways to deal with this.

First, you could include subject as a factor in your model. That is, just as you estimate an effect of male vs. female, you estimate the effect of being subject A vs. being subject B, etc. This is a fixed effect. In your case, this really won't work because you don't have enough data on each subject and a model that tries to fit time, sex and subject will have used all its degrees of freedom and not be identifiable.

Another way to include subject is to use a random effect. What this means is that you consider the subjects in your study to be a random sample from a population of interest and instead of estimating an effect associated with each subject, you estimate the mean and variance (SD) of the population. With your four subjects, the fixed effect approach will consume 3 degrees of freedom, using random effects consumes (in some sense) 2 degrees of freedom. (Actually, that isn't quite true: the effect of the random effects on the model's degrees of freedom is a difficult topic, but that statement is good enough for our purposes.) Many models with repeated measures would be saturated or worse if the subjects were modeled as fixed effects and random effects are a common strategy for dealing with such situations.

A final strategy, since you have only two measures, is to subtract them and use the differences as your response. Then your test is a one-sample test of whether the differences are equal to zero. If they are significantly greater than 0, you know that afternoons are associated with more mistakes than mornings, and if the differences are <0, then vise versa. Taking differences uses each subject as their own control, so subject doesn't have to be controlled for with fixed or random effects. In your case, this approach will be a little more complicated because you also want to test for the effect of sex, but this version could still be used.

Some other notes: You will have great difficulty doing anything with those data (if they are actually your real data) because they are so few. In addition, ordinary linear models (i.e., lm) would not be appropriate for those data because they are counts. You should use a Poisson type model instead (i.e., glm, or in this case, glmer).

Here is a quick analysis in R, using random effects and a Poisson GLMM:

##### entering your data into a data frame:  
my.data = "Subject / Sex / morning / afternoon
A / M / 2 / 5
B / F / 1 / 4
C / M / 3 / 5
D / F / 1 / 5"
my.data = gsub(my.data, pattern="\n", replacement=" / ")
my.data = unlist(strsplit(my.data, split=" / "))
my.data = matrix(my.data, ncol=4, byrow=T)
colnames(my.data) = my.data[1,]
my.data = as.data.frame(my.data[-1,])
my.data$morning   = as.numeric(as.character(my.data$morning))
my.data$afternoon = as.numeric(as.character(my.data$afternoon))
my.data
#   Subject Sex morning afternoon
# 1       A   M       2         5
# 2       B   F       1         4
# 3       C   M       3         5
# 4       D   F       1         5

##### creating a 'long form' dataset:
library(reshape2)
my.data.l = melt(my.data, id.vars=c("Subject", "Sex"), 
                 measure.vars=c("morning", "afternoon"),
                 variable.name="time", value.name="mistakes")
my.data.l
#   Subject Sex      time mistakes
# 1       A   M   morning        2
# 2       B   F   morning        1
# 3       C   M   morning        3
# 4       D   F   morning        1
# 5       A   M afternoon        5
# 6       B   F afternoon        4
# 7       C   M afternoon        5
# 8       D   F afternoon        5
detach(package:reshape2, unload=T)

##### this is the analysis of your data:
library(lme4)
model = glmer(mistakes~time+Sex+(1|Subject), my.data.l, family=poisson)
summary(model)
# Generalized linear mixed model fit by maximum likelihood ['glmerMod']
#  Family: poisson ( log )
# Formula: mistakes ~ time + Sex + (1 | Subject) 
#    Data: my.data.l 
# 
#      AIC      BIC   logLik deviance 
#  32.3889  32.7067 -12.1945  24.3889 
# 
# Random effects:
#   Groups  Name        Variance  Std.Dev. 
#   Subject (Intercept) 5.704e-13 7.552e-07
# Number of obs: 8, groups: Subject, 4
# 
# Fixed effects:
#               Estimate Std. Error z value Pr(>|z|)  
# (Intercept)     0.3926     0.4419   0.888   0.3744  
# timeafternoon   0.9985     0.4421   2.258   0.0239 *
# SexM            0.3102     0.3970   0.781   0.4346  
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Correlation of Fixed Effects:
#             (Intr) tmftrn
# timeafternn -0.731       
# SexM        -0.518  0.000

Best Answer

Related Solutions

Solved – R: Calculating mean and standard error of mean for factors with lm() vs. direct calculation -edited

Solved – How to use a linear model with two factors and repeated measures

Related Question