Solved – Linear model fit seems off in R

rregression

I'm posting this in this forum because I think the question is related to stats. If it's not, I'm fine with it being moved over to Stackoverflow.

I have data in R and am trying to fit a linear model to it. Here's what the data looks like (sorry that it's not reproducible, it's just too much data to type out). The colored dots are based on density of points. The black line is what a linear model is returning as the best fit line (via lm(y~x) which spits out a slope of 0.67 and an intercept of 0.002) and the dashed line is what I would think the best fit line should be (slope of 5, intercept of -3).

Why is R's lm method giving me a line which looks like it doesn't fit at all. Is it true that that line is better fitting than the dashed line I propose?

enter image description here

Best Answer

Your dashed line doesn't look like it's least squares (minimizing the sum of squares of the vertical distances) to me, but more like a line that attempts to minimize the orthogonal distances. To get an idea where the LS line should go, divide the range of x's into vertical strips and find the average of y in each strip.

enter image description here

If a straight line is appropriate those averages should lie relatively close to a straight line... the regression line.

Below I have taken your plot and marked two such strips (delimited by two red lines on the strip on the left and two purple lines for the strip on the right):

enter image description here

I've also marked a rough (by eye) guess at where the mean y in each strip is, and indicated it with a "+" of the corresponding color.

As you see they both lie close to the regression line, and nowhere near your line.

So the regression line R gave you looks just about exactly right to me.


Now, if your data are bounded by 0 and 1 ... why on earth would you fit a straight line? How can that be right?

Related Question