Solved – Why test for both correlation and groups differences using the same variable

correlationgroup-differencespost-hoc

A paper asks four research questions which deal with the measurement of latent trait x, a continuous variable which is acquired through learning or practice, and the speed with which the knowledge or skill can be accessed and used in a large group of participants ($n=500$). The first two research questions are:

  1. Is there a difference in response times between groups with different trait x?
  2. What is the correlation between response times and trait x?

Subjects are assigned to the groups referenced in RQ1 post hoc based on their performance on the trait x test. It seems to me that if there is a correlation between response times and trait x, then one could always find a grouping such that there will be a statistically significant difference between the groups.

So, my questions are:

  1. Can anyone explain to me why both questions need to be asked?
  2. Is there some fundamental statistical fallacy going on here or am I being overly simplistic in my thinking?

edit

The trait could be any psychometrically measured latent trait, but I have often seen these two questions appearing simultaneously in a variety of papers where something has been acquired through learning or practice such as in education or applied linguistics. In this example, it could be something trivial such as the ability to correctly pick the name of a famous person in a picture from a list of possible names.

From what I understand, one of the main problems with the first question is the arbitrary discretisation of a continuous variable (trait x) which results in information loss and possible bias through the arbitrary bins/thresholds which are created. Is this correct? Are there other problems happening here?

real life example

I've been hesitant to give an actual example because then replies usually focus on the specifics whereas I was trying to generalise, but here is one. There are two research questions:

  1. What is the size of the vocabulary of university juniors and how good is their reading comprehension?

  2. Do university students’ vocabulary knowledge and content knowledge influence their reading comprehension?

The researcher administers test of reading comprehension, vocabulary size, and content knowledge. Ignoring the double-barrelled RQ1, so far, so good. But here is the part I can't understand (which is similar to my more abstract question above): The analysis includes:

a correlation analysis

Variables               Vocabulary knowledge Content knowledge    
---------------------   -------------------- -----------------
Reading comprehension   .70**                .41**                
Vocabulary knowledge                         .22**                

**p < .01 (two-tailed)  

a comparison of groups based on vocabulary size test performance

                   Reading comprehension      Content knowledge
                   Mean   SD    t        df   Mean   SD    t     df
-----------------  -------------------------  ----------------------
Above (n = 83)     20.34  6.18  12.25**  244  30.10  4.31  2.06  192
Below (n = 163)    11.23  5.15                28.82  5.12

**p ≦ .01                            

and a multiple regression

Model        Sum of squares  df    Mean square  F       
----------   --------------  ----  -----------  --------
Regression   6719.39         2     154.91       154.91**
Residual     5270.35         243   21.67        
Total        11989.74        245 

Note. R2 = .56
**p < .01    

I simply can not figure out how the second test (a t test) is even justified because the groups are created from a variable which has already been shown to correlate with the DVs in the t tests. Isn't that a foregone conclusion?

note: I realise that the study is not particularly good and that given a multiple regression, both the t test and the correlation are not even relevant here, but this is simply an example of the phenomenon I'm asking about in the question. That is, if a correlation has already been established between two variables, does it make sense to then arbitrarily bin one of the variables into groups and test for a difference between groups. I see variations on this theme quite often and I can't figure out why it's justified.

Best Answer

If the groups are really post-hoc and based on the level of x, then the first test seems unnecessary and also seems to violate the assumptions.

But the second test seems fine; I've seen this sort of thing done lots of times and it seems inherently reasonable

It would help if you provided context: What is x?

Related Question