You're right that not incorporating subject
somehow will violate the assumptions of the model. In general, the point estimates will be unbiased, but your interval estimates (confidence intervals, p-values, etc.) will not have the right width. Typically, they would be too narrow, although they can be too wide sometimes. There are several ways to deal with this.
First, you could include subject
as a factor in your model. That is, just as you estimate an effect of male
vs. female
, you estimate the effect of being subject A
vs. being subject B
, etc. This is a fixed effect. In your case, this really won't work because you don't have enough data on each subject and a model that tries to fit time
, sex
and subject
will have used all its degrees of freedom and not be identifiable.
Another way to include subject
is to use a random effect. What this means is that you consider the subjects in your study to be a random sample from a population of interest and instead of estimating an effect associated with each subject, you estimate the mean and variance (SD) of the population. With your four subjects, the fixed effect approach will consume 3 degrees of freedom, using random effects consumes (in some sense) 2 degrees of freedom. (Actually, that isn't quite true: the effect of the random effects on the model's degrees of freedom is a difficult topic, but that statement is good enough for our purposes.) Many models with repeated measures would be saturated or worse if the subjects were modeled as fixed effects and random effects are a common strategy for dealing with such situations.
A final strategy, since you have only two measures, is to subtract them and use the differences as your response. Then your test is a one-sample test of whether the differences are equal to zero. If they are significantly greater than 0, you know that afternoons are associated with more mistakes than mornings, and if the differences are <0, then vise versa. Taking differences uses each subject as their own control, so subject
doesn't have to be controlled for with fixed or random effects. In your case, this approach will be a little more complicated because you also want to test for the effect of sex
, but this version could still be used.
Some other notes: You will have great difficulty doing anything with those data (if they are actually your real data) because they are so few. In addition, ordinary linear models (i.e., lm
) would not be appropriate for those data because they are counts. You should use a Poisson type model instead (i.e., glm
, or in this case, glmer
).
Here is a quick analysis in R, using random effects and a Poisson GLMM:
##### entering your data into a data frame:
my.data = "Subject / Sex / morning / afternoon
A / M / 2 / 5
B / F / 1 / 4
C / M / 3 / 5
D / F / 1 / 5"
my.data = gsub(my.data, pattern="\n", replacement=" / ")
my.data = unlist(strsplit(my.data, split=" / "))
my.data = matrix(my.data, ncol=4, byrow=T)
colnames(my.data) = my.data[1,]
my.data = as.data.frame(my.data[-1,])
my.data$morning = as.numeric(as.character(my.data$morning))
my.data$afternoon = as.numeric(as.character(my.data$afternoon))
my.data
# Subject Sex morning afternoon
# 1 A M 2 5
# 2 B F 1 4
# 3 C M 3 5
# 4 D F 1 5
##### creating a 'long form' dataset:
library(reshape2)
my.data.l = melt(my.data, id.vars=c("Subject", "Sex"),
measure.vars=c("morning", "afternoon"),
variable.name="time", value.name="mistakes")
my.data.l
# Subject Sex time mistakes
# 1 A M morning 2
# 2 B F morning 1
# 3 C M morning 3
# 4 D F morning 1
# 5 A M afternoon 5
# 6 B F afternoon 4
# 7 C M afternoon 5
# 8 D F afternoon 5
detach(package:reshape2, unload=T)
##### this is the analysis of your data:
library(lme4)
model = glmer(mistakes~time+Sex+(1|Subject), my.data.l, family=poisson)
summary(model)
# Generalized linear mixed model fit by maximum likelihood ['glmerMod']
# Family: poisson ( log )
# Formula: mistakes ~ time + Sex + (1 | Subject)
# Data: my.data.l
#
# AIC BIC logLik deviance
# 32.3889 32.7067 -12.1945 24.3889
#
# Random effects:
# Groups Name Variance Std.Dev.
# Subject (Intercept) 5.704e-13 7.552e-07
# Number of obs: 8, groups: Subject, 4
#
# Fixed effects:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 0.3926 0.4419 0.888 0.3744
# timeafternoon 0.9985 0.4421 2.258 0.0239 *
# SexM 0.3102 0.3970 0.781 0.4346
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Correlation of Fixed Effects:
# (Intr) tmftrn
# timeafternn -0.731
# SexM -0.518 0.000
Best Answer
If we look on the first rows of the data frame
ozone
, we see that variable names (columns) areStation
,Av8top
,Lat
andLon
.When you fit model with code
lm(ozone~.,data=ozone)
, you are telling R that there is variableozone
(column) in your data that should be dependent variable. But there are no column namedozone
. So either other column should be used as dependent variable or the file you are downloading is wrong.