Solved – Anchoring a linear regression to a specific data point in r

change pointlmr

This is different, follow up question to someone else's question here:

I am very new to R and have no programming experience what so ever but I am holding my own for the data analysis we need in my lab. I am detecting changepoints in physiological data using cpt.mean() and until now I have been doing two separate linear regressions with lm(). One for the first half of my data and another for my second half of my data. I determine where to divide up the data after cpt.mean() tells me my changepoint. The first line would normally have a slope around 0 and the other could be anywhere from 0.2 to 2. I would then add those lines separately to my graph and clip them so that it appeared that I only had only line graphed but that that line had a "corner." At first it worked really well but the point where the two lines joined together was often further from the changepoint found with cpt.mean because what I graphed was determined by where the two equations mathematically intersected.

So I thought it would be better to find linear regression lines that anchor to the changepoint. That way when I graph two seperate lines, they connect at the changepoint that cpt.mean. I have read and understand the basic solution to this question linked above but I cant seem to figure out how to limit the regression so that it only fits half of the data.

In a made up scenario my changepoint could be my 26th data point (10, 20)

Normally I do something like this for the first half of the data: (PP is my X and MAP is my Y variable):

lm(MAP[1:26]~PP[1:26])

The simple solution to anchoring the best fit line to a specific point,(X0, Y0) was this

lm(I(y-y0)~I(x-x0) + 0)

with my variables and changepoint:

lm(I(MAP-20)~I(PP-10) + 0)

I figured I could just combine the two as follows:

lm(I(MAP[1:26]-20)~I(PP[1:26]-10) +0)

No error came up but nothing showed up on my scatter plot. Not sure whats happening.

Here is an example of a what my graphs normally look like:
enter image description here

Best,
Ian

Best Answer

Your model implicitly is this. You have data $(x_1,y_1), \ldots, (x_m,y_m)$ and other data $(x_{m+1},y_{m+1}), \ldots, (x_n,y_n)$ for which $x_j \ge x_i$ whenever $m \lt j \le n$ and $1 \le i \le m$: the first data set is the "left portion" and the second is the "right portion". You wish to estimate coefficients $\alpha, \alpha^\prime, \beta, \beta^\prime$ such that the "best fit" (presumably least squares) to the left half is

$$y = \alpha + \beta x,$$

the best fit to the right half is

$$y = \alpha^\prime + \beta^\prime x,$$

and the graphs of these two lines meet at some point $x_0$ between $\max_{1\ldots m} x_i$ and $\min_{m+1\ldots n} x_j$.

The two-regression procedure you have been doing appears to be an attempt to fit the related model

$$Y_i = \beta_0 + \beta_1 x_i + \beta_2(x_i-x_0)^{+} + \epsilon_i$$

where the $Y_i$ are random variables representing how $y_i$ were obtained; $\epsilon_i$ are independent, zero-mean random variables, they all have the same variance $\sigma_l^2$ in the left portion, and they all have the same variance $\sigma_r^2$ in the right portion; and for any number $z$, $$z^{+}=\max(0, z)$$ is the "positive part" of $z$.

This model assumes a slope of $\beta_1$ underlies the left portion (so $\beta_1$ plays the role of $\beta$) and a slope of $\beta_1 + \beta_2$ underlies the right portion (so $\beta_1+\beta_2$ plays the role of $\beta^\prime$). By using a single intercept, it assumes there is no break at $x_0$: the curve is continuous there. This could be called a "linear spline with one knot."

The easiest way to fit a model like this is multiple regression of the $y_i$ against the $x_i$ and the $(x_i-x_0)^{+}$ (which you would compute for this purpose). That fit assumes $\sigma_r^2 = \sigma_l^2$ (homoscedasticity), which is slightly different from what you have been doing but might be what you intended. If you want to fit the full (heteroscedastic) model, it looks necessary to work out the Maximum Likelihood solution.

Neither procedure will be the same as intersecting two separate least squares solutions--but if that intersection happens to be close to $x_0$, they will all give very similar fits. One huge advantage of using least squares or Maximum Likelihood is that they afford estimates of uncertainty, such as confidence intervals. These will be appropriate when the change point is found independently of the data; when determined from the data, confidence intervals have to be increased to account for the loss of that "degree of freedom."

Figure

To illustrate these ideas and demonstrate that the various procedures differ, the figure shows a scatterplot of 40 data points with a changepoint at $x_0=15$ (marked with the vertical line). Before the changepoint, errors drawn from a distribution with standard deviation $\sigma_l=3$ were added; after the changepoint, the distribution had SD $\sigma_r=15$. The true underlying lines are shown in solid gray; the separately fitted least squares lines in dashed blue; the homoscedastic solution proposed here is the unbroken red curve, and the heteroscedastic solution is the black curve. Note how--by construction--the bend in the fitted (red and black) lines occurs exactly at the changepoint. Because these data are strongly heteroscedastic, we would expect the heteroscedastic model to be better--and indeed the black curve is closer to the true gray curve than the red curve is (and the dotted blue curve, for the separate regressions, turns out worst in this example).

It might also be of interest to compare the summaries of three models: the homoscedastic fit, the fit to the left portion, and the fit to the right portion. Here they are, without further comment.

Call:
lm(formula = y ~ x + x.plus, data = X)
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.3357     5.4842   0.244    0.809    
x            -1.2498     0.4950  -2.525    0.016 *  
x.plus        3.3036     0.6729   4.910 1.86e-05 ***
---
Residual standard error: 10.92 on 37 degrees of freedom
Multiple R-squared:  0.6695,    Adjusted R-squared:  0.6516 
F-statistic: 37.47 on 2 and 37 DF,  p-value: 1.275e-09

Call:
lm(formula = y ~ x, data = X, subset = left)
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.3424     1.0319   3.239  0.00646 ** 
x            -1.6260     0.1135 -14.327 2.43e-09 ***
---
Residual standard error: 1.899 on 13 degrees of freedom
Multiple R-squared:  0.9404,    Adjusted R-squared:  0.9359 
F-statistic: 205.3 on 1 and 13 DF,  p-value: 2.429e-09


Call:
lm(formula = y ~ x, data = X, subset = !left)
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -43.4024    10.9327  -3.970 0.000606 ***
x             1.9033     0.3781   5.034 4.29e-05 ***
---
Residual standard error: 13.63 on 23 degrees of freedom
Multiple R-squared:  0.5242,    Adjusted R-squared:  0.5035 
F-statistic: 25.34 on 1 and 23 DF,  p-value: 4.29e-05

Here is R code to implement the procedures evaluated here and reproduce the data.

n <- 40              # Sample size
x <- 1:n             # X coordinates
x.0 <- 15            # X-coordinate of the changepoint location
beta <- c(5, -2, 5)  # Parameters: intercept, left slope, right slope
sigma <- c(3, 15)    # Error variances: left, right
#
# Generate data.
#
set.seed(17)
left <- x <= x.0
m <- sum(left)
x.plus <- pmax(0, x-x.0)
X <- cbind(1, x, x.plus)
y <- rnorm(n, X %*% beta, ifelse(x <= x.0, sigma[1], sigma[2]))
plot(x,y, pch=16, col="#00000040")
abline(v=x.0)
abline(beta[1:2], lwd=2, col="Gray")
abline(c(beta[1] - beta[3]*x.0, sum(beta[2:3])), lwd=2, col="Gray")
#------------------------------------------------------------------------------#
# 
# As fitting occurs, lines are added to the plot of the data to document
# progress.
#
# Linear spline fit.
#
X <- data.frame(x=x, x.plus=x.plus)
fit <- lm(y ~ x + x.plus, X)

x.hat <- seq(min(x), max(x), length.out=1001)
x.hat.plus <- pmax(0, x.hat-x.0)
y.hat <- predict(fit, newdata=data.frame(x=x.hat, x.plus=x.hat.plus))
lines(x.hat, y.hat, col="Red", lwd=2)
#
# Separate least-squares fit to left and right portions.
#
fit.left <- lm(y ~ x, X, subset=left)
fit.right <- lm(y ~ x, X, subset=!left)
abline(fit.left, col="Blue", lty=3, lwd=2)
abline(fit.right, col="Blue", lty=3, lwd=2)
#
# MLE, heteroscedastic case.
#
Lambda. <- function(theta, y, X, left) {
  lambda <- exp(theta)
  w <- ifelse(left, lambda, 1)          # sigma.left/sigma.right
  fit <- lm(y ~ ., data=X, weights=w)
  n <- length(y)
  r2 <- (resid(fit) / w)^2
  s2 <- mean(r2)
  (n + n*log(s2))/2 - sum(left)*theta
}
Lambda <- Vectorize(Lambda., "theta")
theta <- seq(-2, 2, length.out=10)
log.L <- Lambda(theta, y, X, left)
# plot(theta, log.L)

ml <- optimize(Lambda, c(-10, 10), y=y, X=X, left=left)
lambda.hat <- exp(ml$minimum)
w <- ifelse(left, lambda.hat, 1)
fit.ml <- lm(y ~ ., data=X, weights=w)
y.hat <- predict(fit.ml, newdata=data.frame(x=x.hat, x.plus=x.hat.plus))
lines(x.hat, y.hat, col="Black", lwd=2)
Related Question