Solved – Finding correlations in longitudinal data analysis

We are doing research on video lecture watching. We offer a course which lasts four 5 weeks, and there are 9 or 10 videos in each week. We organize group-watching activities. A group usually is composed of 4-5 people, and we have 3 groups. In each group, each participant watched all videos of that week on his/her own device with our software (so we could collect data). We collected data on how they navigated the videos. Right after each week's video watching session, we asked each person to rate the difficulty level of the whole weeks' lecture with 5 point Likert scale.

After 5 weeks, we get a dataset which includes the number of pauses per user per week and the difficulty rating per user per week. The data set is organised as follows:

For the Likert-scale questionnaire, the data is like

  group, person, video.difficultyOfTheWeek, week
  apricot,   A,     5,                       1
  apricot,   B,     3,                       1
  apricot,   C,     4,                       1
  apple,     A,     3,                       1
  apple,     B,     2,                       1
  apple,     C,     2,                       1
  orange,    A,     4,                       4
  orange,    B,     3,                       4
  orange,    C,     4,                       4

We also have pause data similar as follows:

  group, person, numOfPauses, week, totalLengthOfVideoInTheWeekInMinute
  apricot,   A,    15,         1,             125
  apricot,   B,    23,         1,             125
  apricot,   C,    24,         1,             125
  apple,     A,    13,         1,             125
  apple,     B,    12,         1,             125
  apple,     C,     8,         1,             125
  orange,    A,    11,         4,             156
  orange,    B,     4,         4,             156
  orange,    C,     9,         4,             156

What I want to do for the next step is to answer the following question.

Does the number of pauses correlate with the perceived difficulty?

I tried to normalize the number of pauses. I composed a new data set by introducing a new variable "pauseFrequency", which is computed by dividing the number of pauses with the total length of the video of that week (same for all users in that week).

  group, person,  difficulty, week,      pauseFrequency
  apricot,   A,     5,         1,             15/125
  apricot,   B,     3,         1,             23/125
  apricot,   C,     4,         1,             24/125
  apple,     A,     3,         1,             13/125
  apple,     B,     2,         1,             12/125
  apple,     C,     2,         1,             8/125
  orange,    A,     4,         4,             11/156
  orange,    B,     3,         4,             4/156
  orange,    C,     4,         4,             9/156

Then the problem seems to be easy. It seems that I just need to make correlation test with the difficulty column and the pauseFrequency column. I did it. I actually treat all difficulty/frequency pairs in the same way, no matter they are from the same group or from in the same week or whatever. I treat them as individual observation.

I am using R for analysis, then I did a Kendall's Rank Correlation test like the following:

cor.test(pauseFrequency,video.difficulty,method="kendall")

This of course has generated a result, but I am not confident this is correct.I have two concerns:

My measures are actually longitudinal and repeated, each user in each group is measured 5 times.
The "task" for each week is actually different. Although they watched videos of the same course for 5 weeks, but each week they watch a different series of videos (with different content and lengths)

The observations are actually not independent. How can I compensate for this non-independence in my data analysis?

Best Answer

There are two main options available (actually more but let us for simplicity only mention two):

First, you can do standard regression techniques (e.g. linear regression or methods for discrete choice), controlling for example for the length and content of the videos watched (concern two). Of course there will be intra-group correlation and intra-individual correlation. You can adjust your standard errors for intra-group correlation making the variance matrix estimation robust to heteroscedasticity or arbitrary intra-group correlation (also intra-individual correlation if participants do not switch between groups). Here is more information available:

http://www.nber.org/WNE/lect_8_cluster.pdf

and here about the estimation of cluster-robust standard error in R:

http://diffuseprior.wordpress.com/2012/06/15/standard-robust-and-clustered-standard-errors-computed-in-r/

The second approach is two use mixed models (mentioned by gung). A good introduction is Gelman, Andrew, and Jennifer Hill. "Data Analysis Using Regression and Multilevel." (2007). This methods are more efficient compared with the first approach but -of course- you have to do more assumptions about the form of intra-group and intra-individual correlation (the variance matrix) to gain more efficiency.

With both approaches you can calculate easily the (partial) correlation using a linear (additive) functional form or other measures for association.

As far as I understand you, each peer acts on his/her own, so there should not be peer effects which would require more complex models (and more advanced assumptions about the kind of peer-interactions). You should test whether there is actually intra-group correlation, it might be sufficient to control for the longitudinal dimension.

Best Answer

Related Solutions

Related Question