Solved – Error in a linear regression

errorrregressionstandard error

I have a set of points and I would like to fit a linear regression model to them, where each point has its own error value, and I want to find the gradient of the regression line. How do I calculate the standard error on the gradient? In R particularly would be helpful.

Edit:
I feel like I am misusing these terms horribly, so here is what my data looks like:

with a lm fitted line in R. The gradient of the line corresponds to a physical quantity, I need to find the value and standard error in this quantity.

Edit 2: I feel that I should point out that in this case, all the data values have the same precision. I am interested in both this and the general case though.

Best Answer

One approach that comes to my mind looking at your data is to conduct a simulation study. You have five mean values $\bar y_1,...,\bar y_5$ and five corresponding standard deviations $s_1,...,s_5$. It seems from the plot you provided that standard errors for the groups do not differ much from each other (so sample sizes and standard deviations are probably close for the groups), but I assumed more general case when they differ.

Knowing all this you can simulate different regression slopes by sampling $r=1,...,R$ times new values for $y_1^{(r)},..., y_5^{(r)}$ groups from normal distributions (but other choice is also possible, e.g. $t$-distribution)

$$ y_i^{(r)} \sim \mathrm{Normal}(\bar y_i, s_i) $$

and then estimating regression using those values

$$ y_i^{(r)} = \beta_0^{(r)} + \beta_1^{(r)} x_i + \varepsilon_i^{(r)} $$

using $\beta_0^{(r)}$ and $\beta_1^{(r)}$ values from each of the simulation repetitions you can compute the average slope and intercept and compute confidence intervals for those values and around the regression line applying the same methods like you would do with bootstrap results (e.g. using quantiles).

Below you can see R code for such simulation.

# generating example data

set.seed(123)

N <- 5
n <- sample(20, N, replace = TRUE)
s <- runif(N, 0, 3)
m <- runif(N, 0, 5)
X <- rnorm(sum(n), rep(m, n), rep(s, n))
x <- tapply(X, rep(1:N, n), mean)
y <- -6.2 * x + 2 + rnorm(N)

# simulation

f <- function() {
  ysamp <- rnorm(y, y, s)
  fit <- lm(ysamp ~ x)
  out <- c(coef(fit),
           fitted(fit),
           ysamp)
  names(out)[-c(1:2)] <- paste0(rep(c("yhat", "ysim"), each = N), 1:5)
  out
}

sim <- replicate(1e3, f())
coef <- rowMeans(sim[1:2, ])
quant.ci <- apply(sim[3:(2+N), ], 1, quantile, c(.025, .975))

Related Solutions

Solved – Obtaining standard error on a data point obtained from linear regression

The basic idea that you want is either the confidence interval on a predicted mean, or the prediction interval on an individual point. Both formulas are found in any standard regression textbook and probably many places on the web.

Though deriving the correct pieces that you need for those formulas is probably a lot more work than is worth it. Gnuplot is a fine plotting program, but is not a full statistics package. A statistics package will give you the predictions fairly straight forward. The R statistical package is the same general price as gnuplot. In R you can fit your regression using the lm (linear model) function, then use the predict function to get either confidence or prediction intervals. You could generate the intervals for a whole sequence of values, then either plot them directly in R, or transfer the predictions back to gnuplot and add them to your plot there.

Solved – How to interpret model diagnostics when doing linear regression in R

This is a long and rambling question, so you are getting a long and rambling answer. Apologies. Using the example from the ?lm() call,

ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2,10,20, labels=c("Ctl","Trt"))
weight <- c(ctl, trt)
lm.D9 <- lm(weight ~ group)
summary(lm.D9)
#output#
Call:
lm(formula = weight ~ group)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.0710 -0.4938  0.0685  0.2462  1.3690 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.0320     0.2202  22.850 9.55e-15 ***
groupTrt     -0.3710     0.3114  -1.191    0.249    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.6964 on 18 degrees of freedom
Multiple R-squared: 0.07308,    Adjusted R-squared: 0.02158 
F-statistic: 1.419 on 1 and 18 DF,  p-value: 0.249

I don't entirely understand your confusion on the "coefficients." The table simply presents the OLS estimate of $\beta$, standard error of the estimate $SE(\beta)$, the "distance" that $\beta$ is from 0 on the Normal$(0, SE(\beta))$ distribution, and the probability of observing a $\beta$ that far away from 0. Forgive me for the basic statistics review; I can't tell if this is what you are asking for.

Proper OLS-estimated regression modeling (which is what the lm command runs) requires several assumptions, and these diagnostic plots are designed to test them.

The "Residuals vs Fitted" and "Scale-Location" charts are essentially the same, and show if there is a trend to the residuals. OLS models require that the residuals be "identically and independently distributed," that their distribution does not change substantially for different values of $x$. None of your charts is really satisfactory on this regard. If this assumption is not met, your $\beta$ estimates will still be good, but your $t$-statistics, and corresponding $p$-values, are garbage.

Another assumption is that the errors are approximately normally distributed, which is what the Q-Q plot allows you to see. Again, none of your plots really satisfies me in this regard. The consequences of this assumption not being met are the same as above ($\beta$'s good, $t$'s worthless).

The "outliers" principle is actually not an assumption of OLS regression. But if you have outliers in certain locations, your $\beta$ parameters will be unduly influenced by them. In this case, both your $\beta$ and $t$ measurements are useless. You can remove an influential observation from a data frame by identifying its row number and issuing the command

data <- data[-offending.row,]

Where offending.row is the number of the row you want to eliminate. The R diagnostic plots label the row numbers of potential outliers.

I don't know what kind of data you have, but you should be very careful about eliminating observations that you don't like. You should instead ask yourself how that observation became this way. If it is due to measurement error, by all means discard it. If not, then is this observation a part of the system you are trying to model? If so, you should keep it in and adapt for it in other ways.

I have two suggestions for your analysis. First, try to use GLS estimators. This method assigns weights to your observations to correct for heteroskedasticity, outliers, and some degree of non-normality. The R command for this is gls().

But it seems from your plots that your data are restricted in some ways. In particular Test-P seems like a variable that is either 1 or 0, or restricted to that range. For such a variable, you may want to look at binary logit or probit models, available with the command

glm(model, family=binomial(link="logit"))

If your data is censored at 0 but not on the upper end, a tobit model is what you want, tobit() from the AER package looks like the right thing (I've never run a tobit model, I have just looked at it theoretically).

Finally, predictions are done with the predict() function. If you want to perturb your data afterwards (to create a distribution of possible predictions), the best way I know of it to add a random number to the prediction. Using the example above,

#base prediction
pred.values <- predict(lm.D9)
# get standard error of residuals
SER <- (summary(lm.D9)$sigma)^2
#perturbations
pert <- rnorm(length(pred.values), mean=0, sd=SER)
SIMULATION.VALUES <- pred.values + pert

You can get multiple alternate simulations by repeating the last two steps. Good luck.

Best Answer

Related Solutions

Solved – Obtaining standard error on a data point obtained from linear regression

Solved – How to interpret model diagnostics when doing linear regression in R

Related Question