Solved – use date and time in a linear model in R

rregression

I'm trying to make a model of copepod counts made once a day, at varying times every day, over a 1 year period and under seasonally varying oxygen concentrations. I'm basically trying to see if count values are best predicted by time of day, time of year, or oxygen. As oxygen and time of year are correlated, I may end up dropping one of these variables.

Anyways, I'm trying to run a regression in R and it works fine if only oxygen is included, but I think both date and time are being treated like factors instead of as numbers. It will give me a p value for every day in the year, but there is only one observation per day so I don't think that makes sense. The overall p-value at the end of the summary in R is also suspiciously high (0.75) when I try to run only oxygen with date as the predictor, as a know for certain that they co-vary.

Is it even a good idea to run a regression with dates and times?

Is this type of output (p values for every day and every time) to be expected?

Is there a certain format that would work? I currently have dates as "2010-Oct-18" and times as "13:37:17", for example.

Best Answer

I do not have enought reputation to comment so I'll post this as an answer. I suggest you convert it to a unique timestamp (seconds since Jan 1, 1970 for example). This will allow you to investigate correlations that are linear with time.

For periodic relations (time of day or time of year) you can just use the timestamp minus the timestamp from midnight the same day (for day) or minus timestamp from Midnight of Jan 1 from the same year (for year).

Related Solutions

Solved – Seasonally adjusted month-to-month growth with underlying weekly seasonality

I model thus kind of data all the time. You need to incorporate

day-of-the-week
holiday effects ( lead , contemporaneous and lag effects )
special days-of-the-month
perhaps Friday before a holiday or a Monday after a holiday
weekly effects
monthly effects
ARIMA structure to render the errors white noise;
et.al. .

The statistical approach is called Transfer Function Modelling with Intervention DEtection. If you want to share your data either privately via dave@autobox.com or preferably via SE , I would be more than glad to actually show you the specifics of a final model and further your ability to do it yourself or at least to help you and others to understand what needs to be done and what can be done. In either case you come out smarter without spending any treasure be it coin or time.You might read some of my other responses to time series questions to learn more.

Solved – Building a time series that includes multiple observations for each date

Depending on what exactly you mean by "3 reps per quarter" a panel data (wikipedia) model may make sense. This would mean that you're taking three measurements ever quarter, one from each of three distinct sources that stay the same over time. Your data would look something like:

obs quarter value
  A       1   2.2 
  A       2   2.3 
  A       3   2.4 
  B       1   1.8 
  B       2   1.7 
  B       3   1.6 
  C       1   3.3 
  C       2   3.4 
  C       3   3.5

If this is what you're looking at, there are a number of models for working with panel data. Here's a decent presentation that covers some of the basic R that you would use to look at panel data. This document goes into a little more depth, albeit from an econometrics standpoint.

However, If your data doesn't quite fit with panel data methodologies, there are other tools available for "pooled data". A definition from this paper (pdf):

Pooling of data means statistical analysis using multiple data sources relating to multiple populations. It encompasses averaging, comparisons and common interpretations of the information. Different scenarios and issues also arise depending on whether the data sources and populations involved are same/similar or different.

As you can see, from that definition, the techniques you're going to use are going to be dependent on what exactly you expect to learn from your data.

If I were to suggest a place for you to start, assuming that your three draws for each quarter are consistent over time, I would say start by using a fixed effects estimator (also known as the within estimator) with a panel data model of your data.

For my example above, the code would look something like:

> Panel = data.frame(value=c(2.2,2.3,2.4,1.8,1.7,1.9,3.3,3.4,3.5), 
                     quarter=c(1,2,3,1,2,3,1,2,3), 
                     obs=c("A","A","A","B","B","B","C","C","C"))
> fixed.dum <-lm(value ~ quarter + factor(obs), data=Panel)
> summary(fixed.dum)

Which gives us the following output:

Call:
lm(formula = value ~ quarter + factor(obs), data = Panel)

Residuals:
         1          2          3          4          5          6          7 
-1.667e-02 -8.940e-17  1.667e-02  8.333e-02 -1.000e-01  1.667e-02 -1.667e-02 
         8          9 
 1.162e-16  1.667e-02 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.13333    0.06055  35.231 3.47e-07 ***
quarter       0.08333    0.02472   3.371 0.019868 *  
factor(obs)B -0.50000    0.04944 -10.113 0.000162 ***
factor(obs)C  1.10000    0.04944  22.249 3.41e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.06055 on 5 degrees of freedom
Multiple R-squared: 0.9955, Adjusted R-squared: 0.9928 
F-statistic: 369.2 on 3 and 5 DF,  p-value: 2.753e-06

Here we can clearly see the effect of time in the coefficient on the quarter variable, as well as the effect of being in group B, or group C (as opposed to group A).

Hope this points you somewhere in the right direction.

Best Answer

Related Solutions

Solved – Seasonally adjusted month-to-month growth with underlying weekly seasonality

Solved – Building a time series that includes multiple observations for each date

Related Question