Regression – Negative Regression Coefficient Despite Positive Raw Plot

lme4-nlmemixed modelrregression coefficients

EDIT
the data is here

https://www.dropbox.com/s/ufrqesp1tmeh3ll/my.data.csv?dl=0

My data consists of a crop yield value collected over multiple locations and year. This is what my data looks like:

  yield admin1 admin2          x1         x2        year
  6000     31  31002  0.61842540  0.5265148 -1.63343256
  7000     31  31002  0.61842540  0.5265148 -1.05893532
  6500     31  31002  0.61842540  0.5265148 -0.48443809
  7800     31  31002  0.03556101  0.1613198 -0.19718947
  7500     31  31002  0.61842540  0.5265148  0.09005915
  8500     31  31002 -0.44165048 -0.1268841  0.37730777

The locations from which yield data are collected are nested within admin2 and admin2 are nested with admin1. I have two indepenent variables x1 and x2.
I did some pre-processing such that x1 and x2 are in standardised units (i.e. from original x1 and x2, I subtracted the respective mean and divided by the respective SD. Same was done for the year variable)
Some raw plots:

There is weak quadratic relationship between yield with x1 and x2. I fitted a mixed model:

 mod <- lmer(log(yield) ~ x1 + x2 + year + (year |admin1/admin2), REML = FALSE, data = dat)
 summary(mod)

 Fixed effects:
        Estimate Std. Error t value
    (Intercept)  8.41458    0.08582  98.054
      x1          -0.07341    0.01559  -4.709
      x2           0.13192    0.01522   8.667
      year         0.11647    0.02992   3.893

One thing I do not understand is why the coefficient of x1 is negative. Given the raw plot, the coefficient of x1 and x2 should be positive since they have a positive relationship with yield. Even if x1 and x2 are correlated, the correlation is positive so they should not reverse their coefficients sign.

My ultimate aim is to predict yield as a function of x1 and x2

EDIT

I followed the suggesting in the comment and plotted x1 and log yield for different range of x2 and this is what I get. Could anyone tell me what does it tell me w.r.t to why the signs of x1 and x2 are opposite in the model and if does it affect my predictions (I am more interested in the prediction than the sign of the regression coefficient itself).

EDIT

Following Ben's explanation, I am extending this question to get more understanding

x1 and x2 are variables that measure the water availability to crops so as x1 or x2 increases (better water availability), the yield should go up as well (i.e. a positive correlation of x1 and x2 with yield which the univariate plots show). Does this result mean that I cannot use this model for any prediction since the coefficient of x1 is wrong (negative indicting yield goes down with increasing x1) or does it mean that interpreting the reg coefficients as it is not practical in this case?

Best Answer

What is happening here is essentially just Simpson's "paradox". In this particular case you have observed positive marginal correlation between yield and x1, but the relationship turns negative after you condition on x2 and year in your linear model. You can also see from your plots that x1 and x2 have strong positive correlation, so this is giving you strong multicollinearity which would explain the phenomenon in this case.

This type of phenomenon is not unusual when examining relationships between multiple variables, especially when there is strong collinearity. For this reason it is generally misleading to plot crude pairwise comparisons between variables when doing analysis with many variables. If you want to look at the conditional relationship between yield and x1 then this would usually be illustrated with an partial regression plot (also called an added variable plot).

Implementation in R: The effects package has functionality to automatically produce residuals that absorb the lower-order terms marginal to the model variable of interest. This allows you to construct what are effectively partial regression plots for a range of models including lme models. This can be implemented to produce a partial regression plot in R using the code below. (Note that the data file you have linked to does not exactly match with the model output you have presented in your question. I have included the model output from the linked data.)

#Read data (need to put it in working directory first)
DATA <- read.csv('my.data.csv');

#Fit your model
library(lme4);
MODEL <- lmer(log(yield) ~ x1 + x2 + year + (year |admin1/admin2),
              REML = FALSE, data = DATA);

#Show model output
summary(MODEL);

...
Fixed effects:
            Estimate Std. Error t value
(Intercept)  8.41434    0.08585  98.008
x1          -0.07381    0.01558  -4.736
x2           0.13214    0.01521   8.687
year         0.11642    0.02994   3.888
....

#Generate partial regression plot using effects package
library(effects);
PARTIAL_MODEL <- Effect('x1', partial.residuals = TRUE, mod = MODEL);
plot(PARTIAL_MODEL, main = 'Partial Regression Plot',
     xlab = 'x1', ylab = 'Log-Yield');

Model 1 (interaction, linear effects only)

# run model
library(lme4)
mdl <- lmer(yield ~ year * heat  + (1|location), data = my_data)

From your question, it seems like you're interested primarily in the overall fixed effects (not the estimates for individual locations, etc.), so I'll focus on that. If you want to get into visualizing the results for individual locations, I recommend this awesome blog post: https://tjmahr.github.io/plotting-partial-pooling-in-mixed-effects-models/

Here are the fixed effects as I estimated them (they're a bit different from your estimates, so your plot will look a bit different, too, but I think it's close enough to be useful):

> (fe <- summary(mdl)$coefficients[1:4,1])
  (Intercept)          year          heat     year:heat 
-74219.711477     38.293178   -117.339456      0.057416

Now let's plot it to take a look at that interaction. I'll choose to put heat on the x-axis and have three representative lines for low, medium, and high values of year, but you could switch and do it the other way if you prefer. I'll use those values (the full range of heat and the three example values of year) to generate predicted yield scores for each combination based on the fixed effects estimates from the model.

# select values for plotting
# full range of heat, and selected low, med and high values for year
plot_df <- expand.grid(heat=heat, year=c(min(year), mean(year), max(year))) %>% 
  mutate(yield = fe[1] + fe[2]*year + fe[3]*heat + fe[4]*year*heat)

Now I'll plot those predicted yield values as lines.

plot_interaction <- ggplot(plot_df, aes(y=yield, x = heat, color = as.factor(year))) + 
  geom_line(size=2) + 
  labs(color = "year") + 
  # tweak plot appearance
  scale_color_brewer(palette = "Dark2") + 
  theme(legend.position = "top") + 
  theme_classic()

plot_interaction

So what do we see here? First of all, the interaction is very small relative to the individual effects of year and heat --- the lines look almost parallel. But you know from the parameter estimates that it is significantly different from zero, so it's there even if it's small. heat has a negative effect when year = 0 (ideally, you will have centered both predictors before estimating the model, so this would be "at the mean of year"), and the positive interaction term indicates that as year increases, the effect of heat gets less negative, so it weakens. What you should look for in the plot is the slope to be a bit shallower for higher years than for low years.

Model 2 (linear and quadratic effects, with interactions)

The second model seems much more complicated, but it's actually just as easy to plot, almost! We'll do the same procedure as before, first getting the fixed effects from the model then generating some plotting data with the full range of heat, selected values of year, and predicted values for yield based on the fixed effect estimates.

First, since I don't have your real data, I'm re-generating another dataset that will (roughly) match the parameter estimates of the second model, so we can get a more or less accurate plot.

my_data <- base::expand.grid(year=year, heat=heat, location=location) %>% 
  mutate(yield = -97816.61 + 50.07*year + 20499.87*heat - 632.2*heat*heat - 10.22*year*heat +.31*year*heat*heat + rnorm(nrow(.), sd = 50))

Run the second model:

mdl <- lmer(yield ~ year*(heat + I(heat^2)) + (1|location), data = my_data)

Get the fixed effects:

fe <- summary(mdl)$coefficients[1:6,1]

> round(fe, 3)
   (Intercept)           year           heat      I(heat^2)      year:heat year:I(heat^2) 
    -95838.400         49.083      20410.924       -631.207        -10.176          0.310

Close enough. :) Now generate the plotting data.

# full range of heat, and selected low, med and high values for year
plot_df <- expand.grid(heat=heat, year=c(min(year), mean(year), max(year))) %>% 
  mutate(yield = fe[1] + fe[2]*year + fe[3]*heat + fe[4]*heat*heat + fe[5]*year*heat + fe[6]*year*heat*heat)

We can use exactly the same plotting code as before, just feed it the updated dataframe:

plot_interaction %+% plot_df

Now we can see that the general effect of heat on yield is negative, with an accelerating effect, so that there's a steeper negative slope at higher levels of heat. Moreover, we can see that the effect of heat depends on year (i.e. the interaction), such that heat drops off more sharply in earlier years than later years.

You should be able to generate plots for your own data using this code. I always find having a visual is extremely helpful when interpreting interactions, especially in models with more than a few coefficients (such as your model 2).

(Note that if the ranges for heat and year are pretty different in your data to what I came up with generating these data, then your plots might look quite different even though the parameter estimates are similar.)

A final note on plotting and interpreting interactions

My guess from your parameter estimates is that you didn't center year and heat before estimating the model, which is why I didn't either. But you probably should. Not only does it ease some model estimation issues like multicollinearity, it makes interpreting your interactions easier since 0 values are more meaningful. For example, the parameter estimates you're getting for heat are the predicted change in yield for each additional unit of heat when year = 0. Unless you've centered year (or you're studying heat and yields in antiquity), you're probably talking about values far outside of the range of your data. Your estimate for the effect of heat will be much more meaningful if you center first.

Best Answer

Related Solutions

Solved – How to interpret regression coefficients for a variable with takes positive and negative values

R – Interpreting Interactions in Linear Models vs Quadratic Models Using lme4-nlme

Model 1 (interaction, linear effects only)

Model 2 (linear and quadratic effects, with interactions)

A final note on plotting and interpreting interactions

Related Question