Solved – R – how to get standard error for a breakpoint/parameter in a piecewise regression

jagsnlspiecewise linearsegmented regressionstandard error

I'm new to asking questions here, so I hope this is a reasonable place for this.

I am trying to fit a piece-wise regression. I expect my response (y) to increase from x0 to somewhere along x, then to plateau after a breakpoint; thereafter, slope =0. The model I would like to fit has 3 parameters — intercept, slope, and breakpoint. In a best-case scenario, I would compare this model to an intercept-only/null model using AIC or some similar method.

I started with package segmented, with this hint to fix the second slope to 0:
https://stat.ethz.ch/pipermail/r-help/2007-July/137625.html
But after updating a bunch of packages, the whole thing (which was already a bit fussy) stopped working altogether.

So, now, I'm using optim(), but I'm not sure how to appropriately get an estimate of error, particularly around the breakpoint parameter. See code below for examples of attempts using bblme and nls (they both work, but both seem to fail on my data sets).

According to a SAS-using colleague, proc nlin seems to be able to fit this model easily & generates standard error estimates for the parameters… but, I'm determined to Not use SAS.

This exchange (and others) seems to indicate that there are general concerns with error estimates for this type of model,particularly the breakpoint parameter – and it references a potentially useful post by Venables that is unfortunately no longer available.
Change point analysis using R's nls()

(Note, my data are weighted by a variable that represents effort. When it is available, you can feed wt to a weights=() statement, otherwise, I use an expanded dataset or weight the sum of squares in the model.)

Simulate segmented data with diminishing sample size

set.seed(15)
a.sim<-0 #  intercept
b.sim<-0.5 #  slope for segment 1
n<-12
brk<-4 # breakpoint
x <- c(1:n)
y <- numeric(n)
y[1:brk]<-a.sim+b.sim*x[1:brk]+rnorm(brk, 0, .2)
y[(brk+1):n] <- (a.sim+ b.sim* brk) + rnorm((n-brk), 0, .2) # second, flat segment
y[n]<-y[n]-.30*y[n] #subtract from last point, as these are often outside of the normal pattern
wt<-c(rep(50, n-4), c(40, 40, 35, 5)) #weight variable 

dat<-as.data.frame(cbind(x, y, wt)) # make dataframe 
dat.expand <- dat[ rep( seq(dim(dat)[1]), dat$wt),]# second dataset with rows repeated based on weight

with(dat, plot(x,y, ylim=c(0, max(y)), pch=16, cex=wt/(10)))### plot, with symbols representing weight variable

Use optim to solve

mod<-function(par, data){
  a <- par[1]
  b <-par[2]
  x.star <-par[3]
  yfit<-c(NA,length(data$y))
  small.x<-I(data$x<=x.star)
  yfit[small.x==T]<-(a+b*data$x[small.x==T]) 
  yfit[small.x!=T]<-(a+b*x.star) 
  sum((data$y-yfit)^2)
}
fit1<-optim(par=c(a=.5, b=.5, x.star=3), mod, data=dat, hessian=T)

this almost always works well and provides a reasonable fit. For production run, I feed it a large table of potential start values, to find a reasonable start point before actually optimizing.

bblme for cis

I talked to a statistician, who recommended this as a possible method for generating confidence intervals around the breakpoint. It works for my data in some instances, but usually gives errors.

library(bbmle)
mod2<-function(a,b,x.star,sig){
yfit<-c(NA,length(y))
small.x<-I(x<=x.star)
yfit[small.x==T]<-(a+b*x[small.x==T]) 
yfit[small.x!=T]<-(a+b*x.star) 
-sum(dnorm(y,yfit,1, log=TRUE))}

fit3<-mle2(mod2, start=list(a=0,b=0.5,x.star=brk), data=dat)
ci<-confint(fit3)

nls method

This basically never works for my kind of data (works if I simulate much larger datasets) I get either the singular gradient matrix error or
Error in …. step factor 0.000488281 reduced below 'minFactor' of 0.000976562…

nls.mod<-nls(y ~ ifelse(x <x.star, a+b*x,a+b*x.star), 
data = dat, weights=wt,
start = c(x.star=brk,b=0.4, a=0))

JAGS?

So, I also have this working in JAGS/r2jags, again with help from a statistician. I would rather not use this approach if there is a straight-forward, frequentist way to get a reasonable estimate of error around the breakpoint. I'm not looking for perfection here. Just something reasonable..

Best Answer

Hmm, so I think I figured out the issues I was having with segmented. It had to do with the weight statement (it doesn't work to weight the lm and the segmented model).

Segmented seems like the best bet for me. It does a good job estimating breakpoints, even with my short datasets. It isn't terribly difficult to constrain the second slope to 0, and it has a built-in test for the significance of the change in slope. If anyone feels like explaining the difficulty with estimating error around a breakpoint estimate, I'm all ears!

library(segmented)

with second segment not constrained to 0

out.lm <- lm(y~x, data=dat)
o<-segmented(out.lm, seg.Z= ~x, weights=wt, psi=10)
with(dat, plot(x,y, ylim=c(0, max(y)), pch=16, cex=wt/(13), main="segmented"))
lines(x=dat$x, y=fitted(o), col="blue")
lines.segmented(o, col="blue")

fix from vito to fix slope to 0

o2<-lm(y~1)
xx<- -x
o3<-segmented(o2,seg.Z=~xx, weights=wt, psi=list(xx=-4))

points(x,fitted(o3),col="green")
ci<-confint(o3, rev.sgn=TRUE)
lines.segmented(o3, col="green", rev.sgn=TRUE, lwd=2)

test for slope change

davies.test(o3,~xx)

Related Solutions

R – Calculating Standard Error of Slopes in Piecewise Linear Regression with Known Breakpoints

How to easily calculate the intercept and slope of each segment?

The slope of each segment is calculated by simply adding all the coefficients up to the current position. So the slope estimate at $x=15$ is $-1.1003 + 1.3760 = 0.2757\,$.

The intercept is a little harder, but it's a linear combination of coefficients (involving the knots).

In your example, the second line meets the first at $x=9.6$, so the red point is on the first line at $21.7057 -1.1003 \times 9.6 = 11.1428$. Since the second line passes through the point $(9.6, 11.428)$ with slope $0.2757$, its intercept is $11.1428 - 0.2757 \times 9.6 = 8.496$. Of course, you can put those steps together and it simplifies right down to the intercept for the second segment = $\beta_0 - \beta_2 k_1 = 21.7057 - 1.3760 \times 9.6$.

Can the model be reparameterized to do this in one calculation?

Well, yes, but it's probably easier in general to just compute it from the model.

2. How to calculate the standard error of each slope of each segment?

Since the estimate is a linear combination of regression coefficients $a^\top\hat\beta$, where $a$ consists of 1's and 0s, the variance is $a^\top\text{Var}(\hat\beta)a$. The standard error is the square root of that sum of variance and covariance terms.

e.g. in your example, the standard error of the slope of the second segment is:

Sb <- vcov(mod)[2:3,2:3]
sqrt(sum(Sb))

alternatively in matrix form:

Sb <- vcov(mod)
a <- matrix(c(0,1,1),nr=3)
sqrt(t(a) %*% Sb %*% a)

3. How to test whether two adjacent slopes have the same slopes (i.e. whether the breakpoint can be omitted)?

This is tested by looking at the coefficient in the table of that segment. See this line:

I(pmax(x - 9.6, 0))   1.3760     0.2688   5.120 8.54e-05 ***

That's the change in slope at 9.6. If that change is different from 0, the two slopes aren't the same. So the p-value for a test that the second segment has the same slope as the first is right at the end of that line.

Solved – Adding additional terms to a piecewise regression

Can a term be added just to the section before or after the break?<

Sure. You just use an interaction term with the dummy variable for break (0 before break, 1 after break) --> variable*dummy But are you sure you want to do this? You have the domain knowledge, but it is sort of like including the interaction without the main effect. Most would do variable + variable*dummy, this does make interpretation harder but usually seems reasonable to assume there is some effect prior to breakpoint

Instead of adding in terms after I assess the breakpoint should I explore the breakpoint within a the model including all the terms?

As far as I remember the segmented package allows you to do both. In my mind both are reasonable. If it is easy would prefer to do the latter, but often doesn't matter.

I have explored the use of piecewise regression in relation to the distance to a disturbance

I like piece-wise regression. But if you're trying to incorporate non-linearity in your model, some prefer GAM or restricted cubic splines. If this is for publication, I'd look at what the standard is in your field.