Linear Model – Linear Regression to Predict Both Mean and SD of Dependent Variable: Comprehensive Guide

heteroscedasticitylinear modellognormal distribution

Imagine we were to investigate the relationship between people's annual income and daily food expenditure in a fictional population. The following example is not meant to be realistic, but hopefully serves to illustrate the point.

We define ten income groups: 100K, 200K, 300K etc. up to 1 million. For each group, we find 1000 people who have exactly those incomes and ask them how much they spend on food on an average day. We find the following distributions for each group (jitter is applied for better visualisation):

Income vs. expenditure

We calculate the mean and SD for each group. We then use simple linear regression and discover that there is a linear relationship between income and the means we found, and also a linear relationship between the income and the SDs (i.e. SD increases with increasing income).

We also discover that a log-normal distribution can be fitted for each group. This allows us to make a model that can predict the percentiles of expenditure for any income (at least within the range):

enter image description here

Imagine instead that we did not have access to those 10 neat income groups, but instead simply asked e.g. 600 random people (from the same population as before) about their income and their food expenditure, and found this:

enter image description here

Is it possible to approximate the percentiles shown in the second plot when the income variable isn't divided into discrete, equally sized groups? The residuals are heteroscedastic, and let's assume they are also log-normally distributed as before.

Best Answer

As you are interested in modeling percentiles, you should have a look at quantile regression methods. Instead of modeling conditional means (as in linear regression), quantile regression allows you to model (conditional) quantiles.

As mentioned in the comments, a good introduction to quantile regression is the vignette to the quantreg R package. One of the examples in the vignette illustrates your use case:

Quantile Regression, Example from the quantreg vignette

Related Question