Solved – Linear model fit seems off in R

rregression

I'm posting this in this forum because I think the question is related to stats. If it's not, I'm fine with it being moved over to Stackoverflow.

I have data in R and am trying to fit a linear model to it. Here's what the data looks like (sorry that it's not reproducible, it's just too much data to type out). The colored dots are based on density of points. The black line is what a linear model is returning as the best fit line (via lm(y~x) which spits out a slope of 0.67 and an intercept of 0.002) and the dashed line is what I would think the best fit line should be (slope of 5, intercept of -3).

Why is R's lm method giving me a line which looks like it doesn't fit at all. Is it true that that line is better fitting than the dashed line I propose?

enter image description here

Best Answer

Your dashed line doesn't look like it's least squares (minimizing the sum of squares of the vertical distances) to me, but more like a line that attempts to minimize the orthogonal distances. To get an idea where the LS line should go, divide the range of x's into vertical strips and find the average of y in each strip.

enter image description here

If a straight line is appropriate those averages should lie relatively close to a straight line... the regression line.

Below I have taken your plot and marked two such strips (delimited by two red lines on the strip on the left and two purple lines for the strip on the right):

enter image description here

I've also marked a rough (by eye) guess at where the mean y in each strip is, and indicated it with a "+" of the corresponding color.

As you see they both lie close to the regression line, and nowhere near your line.

So the regression line R gave you looks just about exactly right to me.

Now, if your data are bounded by 0 and 1 ... why on earth would you fit a straight line? How can that be right?

Related Solutions

Solved – interpreting the standard error of linear regression output

You have just a single variable in this linear regression:"excesslnst". It has a regression coefficient of 0.51; a standard error of 0.026; a t stat of 19; and a P value of 0.000.

All those values are related. And, together they give you information of how statistically significant is the regression coefficient associated with your variable excesslnst.

The standard error of this regression coefficient captures how much uncertainty is associated with this coefficient. Sometimes, outputs also give you a 95% Confidence Interval around that coefficient. In your case, the low frontier of this Confidence Interval would be equal to: 0.51 - 1.96(Standard Error). And, the high frontier of this same CI would be: 0.51 + 1.96(Standard Error). In this case your 95% CI for this regression coefficient would range from 0.46 to 0.56.

The t stat is equal to your regression coefficient divided by its Standard Error. So, 0.51/0.026 = 19. In other words, your regression coefficient stands 19 Standard Errors away from Zero or from being Null. This is a huge statistical distance away from zero. And, a t stat of 19 translates into a very statistically significant regression coefficient with a P value of 0.000.... The latter is calculated using a T distribution function that just needs the Degree of Freedom in your model (number of observations minus number of variables) in addition to the t stat. Excel, R and most other software programs have ready formulas to calculate such P values.

As outlined, the regression coefficient Standard Error, on a stand alone basis is just a measure of uncertainty associated with this regression coefficient. But, it allows you to construct Confidence Intervals around your regression coefficient. And, just as importantly it allows you to evaluate how statistically significant is your independent variable within this model. So, it is really key to allow you to interpret and evaluate your regression model.

You should certainly not confuse the Standard Error of a regression coefficient with the Standard Error of your overall model. The former allows you to build a Confidence Interval around your regression coefficient. The latter allows you to build a Confidence Interval around your regression model estimates.

Regression – Calculating Uncertainty of Slope with Substantial Error in Dependent Variable

This can be handled with structural equation modelling (SEM)

library(lavaan)

code = '
yhat ~ x     # Latent variable predicted by x
yhat =~ 1*y  # y is the single indicator of yhat
y ~~ 9*y     # y has error variance of 9 (3^2)
'

fit = lavaan(code, dat)
summary(fit)

lavaan 0.6-7 ended normally after 3 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of free parameters                          1
                                                      
  Number of observations                             9
                                                      
Model Test User Model:
                                                      
  Test statistic                                24.572
  Degrees of freedom                                 1
  P-value (Chi-square)                           0.000

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)
  yhat =~                                             
    y                 1.000                           

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  yhat ~                                              
    x                 1.975    0.387    5.101    0.000

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .y                 9.000                           
   .yhat              0.000

Plots can also be useful...

library(semPlot)
semPaths(fit, what = 'path', whatLabels = 'est', layout = 'circle2')

Best Answer

Related Solutions

Solved – interpreting the standard error of linear regression output

Regression – Calculating Uncertainty of Slope with Substantial Error in Dependent Variable

Related Question