Solved – I have a line of best fit. I need data points that will not change the line of best fit

excelleast squaresrregression

I'm giving a presentation about fitting lines. I have a simple linear function, $y=1x+b$. I'm trying to get scattered data points that I can put in a scatter plot that will keep my line of best fit the same equation.

I'd love to learn this technique in either R or Excel–whichever is easier.

Best Answer

Pick any $(x_i)$ provided at least two of them differ. Set an intercept $\beta_0$ and slope $\beta_1$ and define

$$y_{0i} = \beta_0 + \beta_1 x_i.$$

This fit is perfect. Without changing the fit, you can modify $y_0$ to $y = y_0 + \varepsilon$ by adding any error vector $\varepsilon=(\varepsilon_i)$ to it provided it is orthogonal both to the vector $x = (x_i)$ and the constant vector $(1,1,\ldots, 1)$. An easy way to obtain such an error is to pick any vector $e$ and let $\varepsilon$ be the residuals upon regressing $e$ against $x$. In the code below, $e$ is generated as a set of independent random normal values with mean $0$ and common standard deviation.

Furthermore, you can even preselect the amount of scatter, perhaps by stipulating what $R^2$ should be. Letting $\tau^2 = \text{var}(y_i) = \beta_1^2 \text{var}(x_i)$, rescale those residuals to have a variance of

$$\sigma^2 = \tau^2\left(1/R^2 - 1\right).$$

This method is fully general: all possible examples (for a given set of $x_i$) can be created in this way.

Examples

Anscombe's Quartet

We can easily reproduce Anscombe's Quartet of four qualitatively distinct bivariate datasets having the same descriptive statistics (through second order).

The code is remarkably simple and flexible.

set.seed(17)
rho <- 0.816                                             # Common correlation coefficient
x.0 <- 4:14
peak <- 10
n <- length(x.0)

# -- Describe a collection of datasets.
x <- list(x.0, x.0, x.0, c(rep(8, n-1), 19))             # x-values
e <- list(rnorm(n), -(x.0-peak)^2, 1:n==peak, rnorm(n))  # residual patterns
f <- function(x) 3 + x/2                                 # Common regression line

par(mfrow=c(2,2))
xlim <- range(as.vector(x))
ylim <- f(xlim + c(-2,2))
s <- sapply(1:4, function(i) {
  # -- Create data.
  y <- f(x[[i]])                                         # Model values
  sigma <- sqrt(var(y) * (1 / rho^2 - 1))                # Conditional S.D.
  y <- y + sigma * scale(residuals(lm(e[[i]] ~ x[[i]]))) # Observed values

  # -- Plot them and their OLS fit.
  plot(x[[i]], y, xlim=xlim, ylim=ylim, pch=16, col="Orange", xlab="x")
  abline(lm(y ~ x[[i]]), col="Blue")

  # -- Return some regression statistics.
  c(mean(x[[i]]), var(x[[i]]), mean(y), var(y), cor(x[[i]], y), coef(lm(y ~ x[[i]])))
})
# -- Tabulate the regression statistics from all the datasets.
rownames(s) <- c("Mean x", "Var x", "Mean y", "Var y", "Cor(x,y)", "Intercept", "Slope")
t(s)

The output gives the second-order descriptive statistics for the $(x,y)$ data for each dataset. All four lines are identical. You can easily create more examples by altering x (the x-coordinates) and e (the error patterns) at the outset.

Simulations

This R function generates vectors $y$ according to the specifications of $\beta=(\beta_0,\beta_1)$ and $R^2$ (with $0 \le R^2 \le 1$), given a set of $x$ values.

simulate <- function(x, beta, r.2) {
  sigma <- sqrt(var(x) * beta[2]^2 * (1/r.2 - 1))
  e <- residuals(lm(rnorm(length(x)) ~ x))
  return (y.0 <- beta[1] + beta[2]*x + sigma * scale(e))
}

(It wouldn't be difficult to port this to Excel--but it's a little painful.)

As an example of its use, here are four simulations of $(x,y)$ data using a common set of $60$ $x$ values, $\beta=(1,-1/2)$ (i.e., intercept $1$ and slope $-1/2$), and $R^2 = 0.5$.

n <- 60
beta <- c(1,-1/2)
r.2 <- 0.5   # Between 0 and 1

set.seed(17)
x <- rnorm(n)

par(mfrow=c(1,4))
invisible(replicate(4, {
  y <- simulate(x, beta, r.2)
  fit <- lm(y ~ x)
  plot(x, y)
  abline(fit, lwd=2, col="Red")
}))

By executing summary(fit) you can check that the estimated coefficients are exactly as specified and the multiple $R^2$ is the intended value. Other statistics, such as the regression p-value, can be adjusted by modifying the values of the $x_i$.

Related Solutions

Exponential Regression – How to Perform Exponential Regression with Multiple Variables in R

As a start:

f <- function(x1,x2,a,b1,b2) {a * (b1^x1) * (b2^x2) }

# generate some data
x1 <- 1:10
x2 <- c(2,3,5,4,6,7,8,10,9,11)
set.seed(44)
y <- 2*exp(x1/4) + rnorm(10)*2
dat <- data.frame(x1,x2, y)

# fit a nonlinear model
fm <- nls(y ~ f(x1,x2,a,b1,b2), data = dat, start = c(a=1, b1=1,b2=1))

# get estimates of a, b
co <- coef(fm)

Solved – How to calculate the confidence interval of a function of a combination of two linear models

(All credit goes to Robert Shriver, I'm just posting this for him!) This is not an analytic solution, but I think that out of all the iterative solutions this ones' the nicest:

Take the mean regression parameters (intercept and slope) for both regressions (f1 and f2) along with the variance-covariance matrix of the intercept and slope parameters for each regression and plug them into a Multivariate normal distribution to simulate some new intercepts and slopes.

nn <- 100000
nx <- 100

bootf1 <- mvrnorm(nn, f1$coefficients, vcov(f1))
bootf2 <- mvrnorm(nn, f2$coefficients, vcov(f2))

This will be 10,000 random draws of the slope and intercept parameters for each regression. Use each one of these to determine the mean value of z across the range of x values you are interest in.

f1a <- bootf1[,2]
f1b <- bootf1[,1]
f2a <- bootf2[,2]
f2b <- bootf2[,1]

f1a <- rep(f1a, 1, nx)
f1b <- rep(f1b, 1, nx)
f2a <- rep(f2a, 1, nx)
f2b <- rep(f2b, 1, nx)

x  <- seq(-100, 1100, length=nx)
xs <- matrix(rep(x,nn), ncol=nx)

f1y <- f1b+f1a*xs
f2y <- f2b+f2a*xs

z = 1 / exp( 0.02254*f1y^2 / f2y^2 )

Then you can find the lower and upper CI of z as well as its mean by finding the .025 and .975 quantile of z at each x value.

setquantile <- function(x){
  r <- quantile(x, probs=c(0.025, 0.975))
  return(r)
}

ci <- apply(z, 2, setquantile)
mu <- apply(z, 2, mean)

plot(x, mu, type="l")
lines(x, ci[1,])
lines(x, ci[2,])

That's it. Please let me know if you find something better. Hope this helped someone!