Solved – Is it necessary to plot histogram of dependent variable before running simple linear regression

assumptionslinear modelregressionself-study

I was working on an assignment. The data set was really simple, only consisting one independent variable $y$ and dependent variable $x$. Someone suggested me plot a histogram of $y$ before running simple linear regression. He told me that by plotting frequency at y-axis and value of $y$ on x-axis, and then examining whether it looks like normal distribution would help, but I don't see why that is relevant.
This might be a newbie question, but any help would be great!

Update: From what I understand, simple linear regression assumes that the $e_i$ follows normal distribution. Since $y_i=\beta_0 +\beta_1x_i +e_i$ by our assumption, then $y_i$ also follows normal distribution. But I assume that is not what a histogram would tell us…

Best Answer

Is it necessary to plot histogram of dependent variable before running simple linear regression?

Necessary? The short (but incomplete) answer is no. You can fit a regression without it.

[There are circumstances where it may have some value (such as when you've already concluded that the error term isn't normal and are considering transformation - since it's the $y$'s, not the $e$'s which you apply transformation to, it might occasionally be a factor in that choice, though not the most important one).]

Assessing assumptions generally needs to be done post fit (mostly because many of the diagnostic assessments of model assumptions are in one way or another based on residuals or other modelling outputs).

If you want an exploratory look at the I'd see more value in plotting $y$ vs $x$; you'll see more things that might cause problems there.

The data set was really simple, only consisting one independent variable y and dependent variable x. Someone suggested me plot a histogram of y before running simple linear regression. He told me that by plotting frequency at y-axis and value of y on x-axis, and then examining whether it looks like normal distribution would help, but I don't see why that is relevant.

The dependent variable itself (without consideration of the $x$ value) doesn't directly relate to any regression assumption. In that sense, you're correct.

From what I understand, simple linear regression assumes that the $e_i$ follows normal distribution. Since $y_i=β_0+β_1x_i+e_i$ by our assumption, then $y_i$ also follows normal distribution.

Yes, it's $N(β_0+β_1x_i,\sigma^2)$.

But I assume that is not what a histogram would tell us...

correct, since the different $y_i$'s don't all have the same $x$.

You assess the normality assumption by examining residuals.

While you can do this with a histogram, it may not be the best choice.

Compare the two histograms here for example, which are of the same data, with a (slightly) different choice of binwidth. More extensive discussion (and an example where just changing the bin-origin is sufficient to radically change the impression) is here.

One way to avoid that sort of potential effect (such dramatic effects aren't very common, but less dramatic ones happen often) is to use more bins than typical software defaults.

A better choice is Q-Q plots, which eliminate the arbitrary effects of selection of bin-width and bin-origin.


The normality assumption is often less critical to inference than many of the other regression assumptions, and as sample size increases its relative importance decreases for most forms of inference (in large samples, unless you have highly influential observations, the distribution might only matter much for prediction intervals).

Aside from normality, the other assumptions, by contrast, are nearly always important.