Calculate uncertainty of the slope when dependent variable in a linear regression has substantial error

error-propagationregressionregression coefficients

I have a dataset in which the dependent variable (y) has known and substantial error, and yet the observations happen to line up quite well along a line when plotted against the independent variable (x). Fitting a linear regression seems to substantially overestimate the precision of the slope estimate for y vs. x.

How can one appropriately propagate the known error in y through to the estimate of the slope?

I think there is part of an answer here, but it assumes the point fit a linear regression exactly: Calculate uncertainty of linear regression slope based on data uncertainty

As a reproducible example in R:

# the data
set.seed(5)
dat <- data.frame(x = 0:8, y = seq(0,16, length.out=9)+rnorm(9, 0, 0.5), y.se = 3)

# fit a naive model, not considering error in y
mod <- lm(y ~ x, dat)
summary(mod)
preds <- predict(mod, se.fit = TRUE)

plot(dat$x, dat$y, ylim=c(-7,22))
arrows(dat$x, dat$y-1.96*dat$y.se, dat$x, dat$y+1.96*dat$y.se, length=0)

# plot the confidence interval on the linear regression
polygon(c(dat$x, rev(dat$x)), c(preds$fit+preds$se.fit, rev(preds$fit-preds$se.fit)), col = 'grey')

The slope is estimated very precisely near 2.0:

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.04670    0.32783   0.142    0.891    
x            1.97546    0.06886  28.689 1.61e-08 ***

Visually, however, a slope as low as 0.7 or as high as 3.3 would still fit through the error bounds of y quite well.

Best Answer

This can be handled with structural equation modelling (SEM)

library(lavaan)

code = '
yhat ~ x     # Latent variable predicted by x
yhat =~ 1*y  # y is the single indicator of yhat
y ~~ 9*y     # y has error variance of 9 (3^2)
'

fit = lavaan(code, dat)
summary(fit)

lavaan 0.6-7 ended normally after 3 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of free parameters                          1
                                                      
  Number of observations                             9
                                                      
Model Test User Model:
                                                      
  Test statistic                                24.572
  Degrees of freedom                                 1
  P-value (Chi-square)                           0.000

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)
  yhat =~                                             
    y                 1.000                           

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  yhat ~                                              
    x                 1.975    0.387    5.101    0.000

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .y                 9.000                           
   .yhat              0.000

Plots can also be useful...

library(semPlot)
semPaths(fit, what = 'path', whatLabels = 'est', layout = 'circle2')

Related Solutions

Solved – Distance to a regression line, and degrees of freedom

There is a well established theory of prediction intervals in the context of linear regression. New values at $x=x_0$ have a normal distribution with mean $\alpha+\beta x_0$ (not surprisingly) and variance $\sigma^2\left(1+\frac{1}{n} + \frac{(x_0-\bar{x})^2}{\sum{(x_i-\bar{x})^2}}\right)$.

After plugging in the estimated versions of the parameters, the standardized distribution will be a $t$ distribution with $n-2$ degrees of freedom. That's because the estimate of $\sigma^2$ has that many degrees of freedom, and the df of the chi-squared term in the denominator drives the degrees of freedom.

Intuitively, you can think that you are not using the new data point for estimating anything, so you are not gaining any degrees of freedom.

Solved – Standard error of slopes in piecewise linear regression with known breakpoints

How to easily calculate the intercept and slope of each segment?

The slope of each segment is calculated by simply adding all the coefficients up to the current position. So the slope estimate at $x=15$ is $-1.1003 + 1.3760 = 0.2757\,$.

The intercept is a little harder, but it's a linear combination of coefficients (involving the knots).

In your example, the second line meets the first at $x=9.6$, so the red point is on the first line at $21.7057 -1.1003 \times 9.6 = 11.1428$. Since the second line passes through the point $(9.6, 11.428)$ with slope $0.2757$, its intercept is $11.1428 - 0.2757 \times 9.6 = 8.496$. Of course, you can put those steps together and it simplifies right down to the intercept for the second segment = $\beta_0 - \beta_2 k_1 = 21.7057 - 1.3760 \times 9.6$.

Can the model be reparameterized to do this in one calculation?

Well, yes, but it's probably easier in general to just compute it from the model.

2. How to calculate the standard error of each slope of each segment?

Since the estimate is a linear combination of regression coefficients $a^\top\hat\beta$, where $a$ consists of 1's and 0s, the variance is $a^\top\text{Var}(\hat\beta)a$. The standard error is the square root of that sum of variance and covariance terms.

e.g. in your example, the standard error of the slope of the second segment is:

Sb <- vcov(mod)[2:3,2:3]
sqrt(sum(Sb))

alternatively in matrix form:

Sb <- vcov(mod)
a <- matrix(c(0,1,1),nr=3)
sqrt(t(a) %*% Sb %*% a)

3. How to test whether two adjacent slopes have the same slopes (i.e. whether the breakpoint can be omitted)?

This is tested by looking at the coefficient in the table of that segment. See this line:

I(pmax(x - 9.6, 0))   1.3760     0.2688   5.120 8.54e-05 ***

That's the change in slope at 9.6. If that change is different from 0, the two slopes aren't the same. So the p-value for a test that the second segment has the same slope as the first is right at the end of that line.

Best Answer

Related Solutions

Solved – Distance to a regression line, and degrees of freedom

Solved – Standard error of slopes in piecewise linear regression with known breakpoints

Related Question