R – How to Analyze Longitudinal Data in R

panel datarregressionrepeated measures

I am trying to examine how an athlete’s performance influences their articulations on Twitter on specific dimensions of research interest (e.g., use of ‘we’ personal pronoun).

I have all the tweets of over 100 athletes along with time stamp. I have a frequency count measure of ‘we’ for every tweet. For every athlete, I also have performance track record – i.e., contest participated, date for contest, and result (win/loss/draw). I would like to statistically analyze the effect of performance on the use of ‘we’ and test hypothesis such as the following:

Hypothesis. Win (loss) decreases (increases) the likelihood of using ‘we’ in tweets.

I would like to understand how to analyze such a time series data. How should I structure the data for such an analysis in R or Python? What regression models are most appropriate for such an analysis in R?

Best Answer

Model Formulation

One way forward may be to aggregate tweets by week and count the number of occurrences of the use "we" adjusting for wins/losses and using an offset to account for the number of tweets made.

If your hypothesis is that "Wins and losses effect how frequently an athlete refers to the team collectively as 'we'" then it might be sensible to formulate your data as follows

Week # We Tweets Wins Losses ID
0 ... ... ... ... ...
1 ... ... ... ... ...

Here, "# We" us the outcome (which I will reffer to as $y$). Even under the null hypothesis (Wins and Losses do not effect the frequency of $y$) the frequency can none the less increase/decrease simply by tweeting more. Thus, we will need to account for that somehow.

A typical model for count data is Poisson regression. We can perhaps the model the frequency of $y$ as follows

$$ \log(E(y_{i, j})) = \beta_{0, i} + \beta_1 \mbox{week}_{i,j} + \beta_2\mbox{Wins}_{i,j} + \beta_3 \mbox{Losses}_{i,j} + \log(\mbox{Tweets}_{i,j}) $$

There are a few important things to note here:

  1. Each athlete has their own intercept in this model $\beta_{0,i}$. This sort of model is known as a mixed effects model and can account for the longitudinal nature of the data.

  2. $\log(\mbox{Tweets})$ does not have a coefficient. This is known as an offset and accounts for an increase in the frequency of $y$ simply by increasing the number of tweets.

You could run this regression and examine the coefficients of $\beta_2, \beta_3$ to evaluate your hypotheses. However, there are additional considerations before moving forward.

  1. A random intercept (one for each athlete) is sort of the minimum way you can account for the longitudinal nature. It may be the case that different athletes are effected by wins and losses differently, hence a random slope for these covariates may be more appropriate

  2. The effect of time here is, in my opinion, something which can't be ignored, but a linear effect may or may not be too limiting depending on the size of your data. A spline or generalized additive model may or may not be appropriate given your data.

These are just some criticisms the model may suffer from. You would be able to come up with more since you have more domain expertise than any of us.

Example

Here is an example of how you might structure your data. Let's assume this exists in a dataframe called d.

# A tibble: 6 x 6
   week   ids Ngames  wins losses Ntweets
  <int> <int>  <dbl> <dbl>  <dbl>   <dbl>
1     1     1      5     4      1      40
2     1     2      2     1      1      23
3     1     3      3     2      1      46
4     1     4      5     1      4      45
5     1     5      4     2      2      30
6     1     6      2     1      1      40

To fit the mixed effects model, we need the lme4 library

library(lme4)

model = glmer(y~wins + losses + week + (1|ids), offset = log(Ntweets),  data = d, family=poisson())

Here, the (1|ids) ensures each athlete gets their own intercept.

I would strongly encourage you to make more formal assumptions about how variables like time effect the frequency of $y$ and if you are willing to posit that some athletes may be more strongly effected by winning/losing.