Solved – On the utility of the intercept-slope correlation in multilevel models

mixed modelmultilevel-analysisstata

In their book "Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling" (1999), Snijders & Bosker (ch. 8, section 8.2, page 119) said that the intercept-slope correlation, calculated as the intercept-slope covariance divided by the square root of the product of intercept variance and slope variance, is not bounded between -1 and +1 and can be even infinite.

Given this, I didn't think I should trust it. But I have an example to illustrate. In one of my analysis, which has race (dichotomy), age and age*race as fixed effects, cohort as random effect, and race dichotomy variable as random slope, my series of scatterplot show that the slope does not vary much across the values of my cluster (i.e., cohort) variable, and I don't see the slope becoming less or more steeper across cohorts. The Likelihood Ratio Test also shows that the fit between the random intercept and random slope models is not significant despite my total sample size (N=22,156). And yet, the intercept-slope correlation was near -0.80 (which would suggest a strong convergence in group difference in Y variable over time, i.e., across cohorts).

I think it's a good illustration of why I don't trust the intercept-slope correlation, on top of what Snijders & Bosker (1999) already said.

Should we really trust and report the intercept-slope correlation in multilevel studies? Specifically, what is the usefulness of such correlation?

EDIT 1: I don't think it will answer my question, but gung asked me to provide more information. See below, if it helps.

The data is from the General Social Survey. For the syntax, I used Stata 12, so it reads:

xtmixed wordsum bw1 aged1 aged2 aged3 aged4 aged6 aged7 aged8 aged9 bw1aged1 bw1aged2 bw1aged3 bw1aged4 bw1aged6 bw1aged7 bw1aged8 bw1aged9 || cohort21: bw1, reml cov(un) var
  • wordsum is a vocabulary test score (0-10),
  • bw1 is the ethnic variable (black=0, white=1),
  • aged1-aged9 are dummy variables of age,
  • bw1aged1-bw1aged9 are the interaction between ethnicity and age,
  • cohort21 is my cohort variable (21 categories, coded 0 to 20).

The output reads:

    . xtmixed wordsum bw1 aged1 aged2 aged3 aged4 aged6 aged7 aged8 aged9 bw1aged1 bw1aged2 bw1aged3 bw1aged4 bw1aged6 bw1aged7 bw1aged8 bw1aged9 || cohort21: bw1, reml 
> cov(un) var

Performing EM optimization: 

Performing gradient-based optimization: 

Iteration 0:   log restricted-likelihood = -46809.738  
Iteration 1:   log restricted-likelihood = -46809.673  
Iteration 2:   log restricted-likelihood = -46809.673  

Computing standard errors:

Mixed-effects REML regression                   Number of obs      =     22156
Group variable: cohort21                        Number of groups   =        21

                                                Obs per group: min =       307
                                                               avg =    1055.0
                                                               max =      1728


                                                Wald chi2(17)      =   1563.31
Log restricted-likelihood = -46809.673          Prob > chi2        =    0.0000

------------------------------------------------------------------------------
     wordsum |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         bw1 |   1.295614   .1030182    12.58   0.000     1.093702    1.497526
       aged1 |  -.7546665    .139246    -5.42   0.000    -1.027584   -.4817494
       aged2 |  -.3792977   .1315739    -2.88   0.004    -.6371779   -.1214175
       aged3 |  -.1504477   .1286839    -1.17   0.242    -.4026635     .101768
       aged4 |  -.1160748   .1339034    -0.87   0.386    -.3785207    .1463711
       aged6 |  -.1653243   .1365332    -1.21   0.226    -.4329245     .102276
       aged7 |  -.2355365    .143577    -1.64   0.101    -.5169423    .0458693
       aged8 |  -.2810572   .1575993    -1.78   0.075    -.5899461    .0278318
       aged9 |  -.6922531   .1690787    -4.09   0.000    -1.023641   -.3608649
    bw1aged1 |  -.2634496   .1506558    -1.75   0.080    -.5587297    .0318304
    bw1aged2 |  -.1059969   .1427813    -0.74   0.458    -.3858431    .1738493
    bw1aged3 |  -.1189573   .1410978    -0.84   0.399     -.395504    .1575893
    bw1aged4 |    .058361   .1457749     0.40   0.689    -.2273525    .3440746
    bw1aged6 |   .1909798   .1484818     1.29   0.198    -.1000393    .4819988
    bw1aged7 |   .2117798    .154987     1.37   0.172    -.0919891    .5155486
    bw1aged8 |   .3350124    .167292     2.00   0.045     .0071262    .6628987
    bw1aged9 |   .7307429   .1758304     4.16   0.000     .3861217    1.075364
       _cons |   5.208518   .1060306    49.12   0.000     5.000702    5.416334
------------------------------------------------------------------------------

------------------------------------------------------------------------------
  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
cohort21: Unstructured       |
                    var(bw1) |   .0049087    .010795      .0000659    .3655149
                  var(_cons) |   .0480407   .0271812      .0158491     .145618
              cov(bw1,_cons) |  -.0119882    .015875     -.0431026    .0191262
-----------------------------+------------------------------------------------
               var(Residual) |   3.988915   .0379483      3.915227     4.06399
------------------------------------------------------------------------------
LR test vs. linear regression:       chi2(3) =    85.83   Prob > chi2 = 0.0000

Note: LR test is conservative and provided only for reference.

The scatterplot I produced is shown below. There are nine scatter plots, one for each category of my age variable.

enter image description here

EDIT 2:

. estat recovariance

Random-effects covariance matrix for level cohort21

             |       bw1      _cons 
-------------+----------------------
         bw1 |  .0049087            
       _cons | -.0119882   .0480407

There is another thing I want to add: What bothers me is that, with regard to the intercept-slope covariance / correlation, Joop J. Hox (2010, p. 90) in his book "Multilevel Analysis Techniques and Applications, Second Edition" said that :

It is easier to interpret this covariance if it is presented as a
correlation between the intercept and slope residuals. … In a model
without other predictors except the time variable, this correlation
can be interpreted as an ordinary correlation, but in models 5 and 6
it is a partial correlation, conditional on the predictors in the
model.

So, it seems that not everyone would agree with Snijders & Bosker (1999, p. 119) who believe that "the idea of a correlation does not make sense here" because it's not bounded between [-1, 1].

Best Answer

I have emailed several scholars (almost 30 persons) several weeks ago. Few of them sent their mail (always collective emails). Eugene Demidenko was the first to answer :

cov/sqrt(var1*var2) is always within [-1,1] regardless of the interpretation: it may be estimates of intercept and slope, two slopes, etc. The fact that -1<=cov/sqrt(var1*var2)<=1 follows from the Cauchy inequality and it is always true. Thus I dismiss the Snijders & Bosker statement. Maybe some other piece of information is missing?

This was followed by an email from Thomas Snijders :

The information that is missing is what was actually written about this on page 122, 123, 124, 129 of Snijders & Bosker (2nd edition 2012). This is not about two competing claims of which no more than one can be true, it is about two different interpretations.

On p. 123 a quadratic variance function is introduced, \sigma_0^2 + 2 \sigma_{01} * x + \sigma_1^2 * x^2 and the following remark is made: "This formula can be used without the interpretation that \sigma_0^2 and \sigma_1^2 are variances and \sigma_{01} a covariance; these parameters might be any numbers. The formula only implies that the residual variance is a quadratic function of x.

Let me quote a full paragraph of p. 129, about a quadratic variance function at level two; note that ONE MIGHT INTERPRET that \tau_0^2 and \tau_1^2 are the level-two variances of the random intercept and random slope, and \tau_{01} is their covariance, but this is explicitly put behind the horizon:

"The parameters \tau_0^2, \tau_1^2, and \tau_{01} are, as in the preceding section, not to be interpreted themselves as variances and a corresponding covariance. The interpretation is by means of the variance function (8.7) [note t.s.: in the book this is mistakenly reported as 8.8]. Therefore it is not required that \tau_{01}^2 <= \tau_0^2 * \tau_1^2. To put it another way, 'correlations' defined formally by \tau_{01}/(\tau_0 * \tau_1) may be larger than 1 or smaller than -1, even infinite, because the idea of a correlation does not make sense here. An example of this is provided by the linear variance function for which \tau_1^2 = 0 and only the parameters \tau_0^2 and \tau_{01} are used."

The variance function is a quadratic function of x (the variable "with the random slope"), and the variance of the outcome is this plus the level-1 variance. As long as this is positive for all x, the modelled variance is positive. (An extra requirement is that the corresponding covariance matrix is positive definite.)

Some further background of this is the existence of differences in parameter estimation algorithms in software. In some multilevel (random effects) software, the requirement is made that the covariance matrices of the random effects are positive semi-definite on all levels. In other software, the requirement is made only that the resulting estimated covariance matrix for the observed data is positive semi-definite. This implies that the idea of random coefficients of latent variables is relinquished, and the model specifies a certain covariance structure for the observed data; no more, no less; in that case the cited interpretation of Joop Hox does not apply. Note that Harvey Goldstein already long ago used linear variance functions at level one, represented by a zero slope variance and nonzero slope-intercept correlation at level one; this was and is called "complex variation"; see, e.g., http://www.bristol.ac.uk/media-library/sites/cmm/migrated/documents/modelling-complex-variation.pdf

And then, Joop Hox replied :

In the software MLwiN it is actually possible to estimate a covariance term and at the same time constrain one of the variances to zero, which would make the "correlation" infinite. And yes, some software will allow estimates such as negative variances (SEM software usually allows this). So my statements were not completely accurate. I refered to "normal" unstructured random structures. Let me add that if you rescale the variable with the random slope to have a different zero-point, the variances and covariances generally change. So the correlation is only interpretable if the predictor variable has a fixed zero-point, i.e. is measured on a ratio scale. This applies to growth curve models, where the correlation between initial status and rate of growth is sometimes interpreted. In that case the value zero should be the 'real' time point where the process starts.

And he sent another mail :

Anyway, I think Tom's explanation below fits the style of the Snijders/Bosker collaboration better than my more informal style. I would add to page 90 a footnote stating something like "Note that the parameter values in the random part are estimates. Interpreting the standardized covariances as ordinary correlations assumes that there are no constraints on the variances and that the software does not allow negative estimates. If the random part is unstructured the interpretation as ordinary (co)variances is generally tenable.".

Note that I wrote about the correlation interpretation in the longitudinal chapter. In growth curve modeling it is very tempting to interpret this correlation as a substantive result, and that is dangerous because the value depends on the "metric of time". If you are interested in that I recommend to go to Lesa Hoffman's website (http://www.lesahoffman.com/).

So I think in my situation, where I've specified an unstructured covariance for the random effects, I should interpret the intercept-slope correlation as an ordinary correlation.

Related Question