I am trying to model the relationship between my response 'crop coverage [%]' ~ weed coverage [%] + Soil Moisture [%] using R. Since I am dealing with proportions, I chose to do a beta regression. Since I had some values which were equal to 1, I substracted like 0.0001 from them, so they would meet the criteria for beta distributions. Then I proceeded to fit a model with the betareg() function and plotted the result with ggplot(). The plot looks pretty weird to me. Did I do something wrong?
My data:
df <- structure(list(date = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L), .Label = c("2021-03-17",
"2021-04-07", "2021-04-13", "2021-04-27", "2021-05-11", "2021-05-27"
), class = "factor"), weed_coverage = c(0, 0, 0, 1.7, 1, 5, 0, 0,
0.1, 0.2, 1, 2.8, 2.5, 1, 1, 5, 0, 0, 0.9, 0.7, 0, 1.1, 0.5,
0.5, 0, 0, 0.5, 4, 0, 0.3, 0.8, 4, 1, 2, 2, 6, 0.2, 5, 0, 0,
3, 1, 0, 2, 0, 0, 0, 3, 3, 0), soil_moisture = c(36.28, 37.6, 38.55,
34.38, 34.02, 34.88, 34.92, 38.12, 35.38, 36.92, 27.15, 24.95,
21.38, 22.95, 27.65, 25.7, 27.02, 32.1, 27.18, 26, 14.97, 15.25,
17.02, 16.12, 15.32, 14.3, 14.5, 12.45, 13.07, 15.4, 14.9, 12,
16.85, 17.15, 18.52, 10.68, 13.82, 9.5, 15.32, 10.97, 14.8, 17.05,
26.75, 14.8, 25.75, 19.18, 18.12, 14.22, 18.95, 24.38), crop_coverage = c(0.38,
0.6, 0.75, 0.5, 0.4, 0.48, 0.74, 0.75, 0.27, 0.45, 0.65, 0.3,
0.4, 0.38, 0.45, 0.58, 0.48, 0.75, 0.58, 0.4, 0.9999, 0.7, 0.75,
0.7, 0.85, 0.78, 0.7, 0.91, 0.2, 0.6, 0.95, 0.85, 0.6, 0.7, 0.75,
0.9, 0.8, 0.85, 0.75, 0.96, 0.85, 0.85, 0.75, 0.73, 0.68, 0.7,
0.97, 0.7, 0.75, 0.74)), class = c("grouped_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -50L), groups = structure(list(
Datum = structure(c(1L, 2L, 4L, 5L, 6L), .Label = c("2021-03-17",
"2021-04-07", "2021-04-13", "2021-04-27", "2021-05-11", "2021-05-27"
), class = "factor"), .rows = structure(list(1:10, 11:20,
21:30, 31:40, 41:50), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE))
The code for fitting the model
betareg(crop_coverage ~ soil_moisture + weed_coverage, data = df) -> model_a
The code for the plot
ggplot(df, aes(x = soil_moisture, y = crop_coverage)) +
geom_point(size = 2, shape = 21)+
geom_line(aes(y = predict(model_a, df)), color = "blue")+
theme_bw()+
ylab("Crop coverage [%]")+
xlab("Soil Moisture [%]")
The plot looks like this and seems pretty weird to me. This doesnt look like a line from a beta regression because its not smooth.. Did I miss something? Any help is really appreciated! 🙂 cheers!
Best Answer
Your regression has two inputs
crop coverage ~ soil_moisture + weed_coverage
but you've only plottedcrop_coverage
againstsoil_moisture
while still allowing variability inweed_coverage
.If you want a smooth plot, you need to drop
weed_coverage
from the analysis entirely (not entirely satisfactory); or you can plot $E(\texttt{crop_coverage} \mid \texttt{soil_moisture})$. That is, average over the "hidden" variable. Finally, you could, since you have two input variables in the model, construct a 3D response surface plot to show how the two variables jointly contribute tocrop_coverage
.I've managed to reproduce the same phenomenon for a linear model, by plotting the fitted value while not accounting for another variable included in the model. See the code below: