Beta Regression – Why Beta Regression Might Show an Unusual Plot in R

beta-regressionr

I am trying to model the relationship between my response 'crop coverage [%]' ~ weed coverage [%] + Soil Moisture [%] using R. Since I am dealing with proportions, I chose to do a beta regression. Since I had some values which were equal to 1, I substracted like 0.0001 from them, so they would meet the criteria for beta distributions. Then I proceeded to fit a model with the betareg() function and plotted the result with ggplot(). The plot looks pretty weird to me. Did I do something wrong?

My data:

df <- structure(list(date = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L), .Label = c("2021-03-17", 
"2021-04-07", "2021-04-13", "2021-04-27", "2021-05-11", "2021-05-27"
), class = "factor"), weed_coverage = c(0, 0, 0, 1.7, 1, 5, 0, 0, 
0.1, 0.2, 1, 2.8, 2.5, 1, 1, 5, 0, 0, 0.9, 0.7, 0, 1.1, 0.5, 
0.5, 0, 0, 0.5, 4, 0, 0.3, 0.8, 4, 1, 2, 2, 6, 0.2, 5, 0, 0, 
3, 1, 0, 2, 0, 0, 0, 3, 3, 0), soil_moisture = c(36.28, 37.6, 38.55, 
34.38, 34.02, 34.88, 34.92, 38.12, 35.38, 36.92, 27.15, 24.95, 
21.38, 22.95, 27.65, 25.7, 27.02, 32.1, 27.18, 26, 14.97, 15.25, 
17.02, 16.12, 15.32, 14.3, 14.5, 12.45, 13.07, 15.4, 14.9, 12, 
16.85, 17.15, 18.52, 10.68, 13.82, 9.5, 15.32, 10.97, 14.8, 17.05, 
26.75, 14.8, 25.75, 19.18, 18.12, 14.22, 18.95, 24.38), crop_coverage = c(0.38, 
0.6, 0.75, 0.5, 0.4, 0.48, 0.74, 0.75, 0.27, 0.45, 0.65, 0.3, 
0.4, 0.38, 0.45, 0.58, 0.48, 0.75, 0.58, 0.4, 0.9999, 0.7, 0.75, 
0.7, 0.85, 0.78, 0.7, 0.91, 0.2, 0.6, 0.95, 0.85, 0.6, 0.7, 0.75, 
0.9, 0.8, 0.85, 0.75, 0.96, 0.85, 0.85, 0.75, 0.73, 0.68, 0.7, 
0.97, 0.7, 0.75, 0.74)), class = c("grouped_df", "tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -50L), groups = structure(list(
    Datum = structure(c(1L, 2L, 4L, 5L, 6L), .Label = c("2021-03-17", 
    "2021-04-07", "2021-04-13", "2021-04-27", "2021-05-11", "2021-05-27"
    ), class = "factor"), .rows = structure(list(1:10, 11:20, 
        21:30, 31:40, 41:50), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), row.names = c(NA, -5L), class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE))

The code for fitting the model

betareg(crop_coverage ~ soil_moisture + weed_coverage, data = df) -> model_a

The code for the plot

ggplot(df, aes(x = soil_moisture, y = crop_coverage)) +
    geom_point(size = 2, shape = 21)+
    geom_line(aes(y = predict(model_a, df)), color = "blue")+
    theme_bw()+
    ylab("Crop coverage [%]")+
    xlab("Soil Moisture [%]")

The plot looks like this and seems pretty weird to me. This doesnt look like a line from a beta regression because its not smooth.. Did I miss something? Any help is really appreciated! 🙂 cheers!

enter image description here

Best Answer

Your regression has two inputs crop coverage ~ soil_moisture + weed_coverage but you've only plotted crop_coverage against soil_moisture while still allowing variability in weed_coverage.

If you want a smooth plot, you need to drop weed_coverage from the analysis entirely (not entirely satisfactory); or you can plot $E(\texttt{crop_coverage} \mid \texttt{soil_moisture})$. That is, average over the "hidden" variable. Finally, you could, since you have two input variables in the model, construct a 3D response surface plot to show how the two variables jointly contribute to crop_coverage.

I've managed to reproduce the same phenomenon for a linear model, by plotting the fitted value while not accounting for another variable included in the model. See the code below:

n = 10
x1 = runif(n)
x2 = runif(n)
y = 0.3*x1 - 2*x2 + rnorm(n)

d = tibble(
  y = y,  x1 = x1, x2 = x2
)

lin_model = lm(y ~ x1 + x2)
d$fitted = lin_model$fitted.values
d %>%
  ggplot(aes(x = x1, y = fitted)) +
  geom_point(aes(x = x1, y = y)) +
  geom_line()

Scatter plot with a jagged line similar to the original post

Related Question