Solved – Dealing with regression of unusually bounded response variable

beta distributionbeta-regressionboundsdata transformationregression

I am attempting to model a response variable that is theoretically bounded between -225 and +225. The variable is the total score that subjects got when playing a game. Although theoretically it is possible for subjects to score +225. Despite this because score depended on not only the subjects actions but also the actions of another actions the maximum anyone scored was 125 (this is the highest 2 players playing each other can both score) this happened with a very high frequency. The lowest score was +35.

This boundary of 125 is causing difficulty with a linear regression. The only thing I can think of doing is re-scaling the response to be between 0 and 1 and using a beta regression. If I do this though I am not sure I can really justify saying 125 is the top boundary (or 1 after transformation) since it is possible to score +225. Furthermore, if I did this what would my bottom boundary,35?

Thanks,

Jonathan

Best Answer

Although I'm not entirely certain of what your problem with linear regression is I'm right now finishing an article about how to analyze bounded outcomes. Since I'm not familiar with Beta regression perhaps someone else will answer that option.

By your question I understand that you get predictions outside the boundaries. In this case I would go for logistic quantile regression. Quantile regression is a very neat alternative to regular linear regression. You can look at different quantiles and get a much better picture of your data than what's possible with regular linear regression. It is also has no assumptions regarding distribution1.

Transformation of a variable can often cause funny effects on linear regression, for instance you have a significance in the logistic transformation but that doesn't translate into the regular value. This is not the case with quantiles, the median is always the median regardless of the transformation function. This allows you to transform back and forth without distorting anything. Prof. Bottai suggested this approach to bounded outcomes2, its an excellent method if you want to do individual predictions but it has some issues when you wan't to look at the beta's and interpret them in a non-logistic way. The formula is simple:

$logit(y) = log(\frac{y + \epsilon}{max(y) - y + \epsilon})$

Where $y$ is your score and $\epsilon$ is an arbitrary small number.

Here's an example that I did a while ago when I wanted to experiment with it in R:

library(rms)
library(lattice)
library(cairoDevice)
library(ggplot2)

# Simulate some data
set.seed(10)
intercept <- 0
beta1 <- 0.5
beta2 <- 1
n = 1000
xtest <- rnorm(n,1,1)
gender <- factor(rbinom(n, 1, .4), labels=c("Male", "Female"))
random_noise  <- runif(n, -1,1)

# Add a ceiling and a floor to simulate a bound score
fake_ceiling <- 4
fake_floor <- -1

# Simulate the predictor
linpred <- intercept + beta1*xtest^3 + beta2*(gender == "Female") + random_noise

# Remove some extremes
extreme_roof <- fake_ceiling + abs(diff(range(linpred)))/2
extreme_floor <- fake_floor - abs(diff(range(linpred)))/2
linpred[ linpred > extreme_roof|
    linpred < extreme_floor ] <- NA

#limit the interval and give a ceiling and a floor effect similar to scores
linpred[linpred > fake_ceiling] <- fake_ceiling
linpred[linpred < fake_floor] <- fake_floor

# Just to give the graphs the same look
my_ylim <- c(fake_floor - abs(fake_floor)*.25, 
             fake_ceiling + abs(fake_ceiling)*.25)
my_xlim <- c(-1.5, 3.5)

# Plot
df <- data.frame(Outcome = linpred, xtest, gender)
ggplot(df, aes(xtest, Outcome, colour = gender)) + geom_point()

This gives the following data scatter, as you can see it is clearly bounded and inconvenient:

Scatter of bounded data

###################################
# Calculate & plot the true lines #
###################################
x <- seq(min(xtest), max(xtest), by=.1)
y <- beta1*x^3+intercept
y_female <- y + beta2
y[y > fake_ceiling] <- fake_ceiling
y[y < fake_floor] <- fake_floor
y_female[y_female > fake_ceiling] <- fake_ceiling
y_female[y_female < fake_floor] <- fake_floor

tr_df <- data.frame(x=x, y=y, y_female=y_female)
true_line_plot <- xyplot(y  + y_female ~ x, 
                         data=tr_df,
                         type="l", 
                         xlim=my_xlim, 
                         ylim=my_ylim, 
                         ylab="Outcome", 
                         auto.key = list(
                           text = c("Male"," Female"),
                           columns=2))

##########################
# Test regression models #
##########################

# Regular linear regression
fit_lm <- Glm(linpred~rcs(xtest, 5)+gender, x=T, y=T)
boot_fit_lm <- bootcov(fit_lm, B=500)
p <- Predict(boot_fit_lm, xtest=seq(-2.5, 3.5, by=.001), gender=c("Male", "Female"))
lm_plot <- plot(p, 
             se=T, 
             col.fill=c("#9999FF", "#BBBBFF"), 
             xlim=my_xlim, ylim=my_ylim)

This results in the following picture where females are clearly above the upper boundary:

Linear regression compared to the true line

# Quantile regression - regular
fit_rq <- Rq(formula(fit_lm), x=T, y=T)
boot_rq <- bootcov(fit_rq, B=500)
# A little disturbing warning:
# In rq.fit.br(x, y, tau = tau, ...) : Solution may be nonunique

p <- Predict(boot_rq, xtest=seq(-2.5, 3.5, by=.001), gender=c("Male", "Female"))
rq_plot <- plot(p, 
             se=T, 
             col.fill=c("#9999FF", "#BBBBFF"), 
             xlim=my_xlim, ylim=my_ylim)

This gives the following plot with similar problems:

Quantile regression compared to the true line

# The logit transformations
logit_fn <- function(y, y_min, y_max, epsilon)
    log((y-(y_min-epsilon))/(y_max+epsilon-y))


antilogit_fn <- function(antiy, y_min, y_max, epsilon)
    (exp(antiy)*(y_max+epsilon)+y_min-epsilon)/
        (1+exp(antiy))

epsilon <- .0001
y_min <- min(linpred, na.rm=T)
y_max <- max(linpred, na.rm=T)

logit_linpred <- logit_fn(linpred, 
                            y_min=y_min,
                            y_max=y_max,
                            epsilon=epsilon)

fit_rq_logit <- update(fit_rq, logit_linpred ~ .)
boot_rq_logit <- bootcov(fit_rq_logit, B=500)

p <- Predict(boot_rq_logit, 
             xtest=seq(-2.5, 3.5, by=.001), 
             gender=c("Male", "Female"))

# Change back to org. scale
# otherwise the plot will be
# on the logit scale
transformed_p <- p
transformed_p$yhat <- antilogit_fn(p$yhat,
                                    y_min=y_min,
                                    y_max=y_max,
                                    epsilon=epsilon)
transformed_p$lower <- antilogit_fn(p$lower, 
                                     y_min=y_min,
                                     y_max=y_max,
                                     epsilon=epsilon)
transformed_p$upper <- antilogit_fn(p$upper, 
                                     y_min=y_min,
                                     y_max=y_max,
                                     epsilon=epsilon)

logit_rq_plot <- plot(transformed_p, 
             se=T, 
             col.fill=c("#9999FF", "#BBBBFF"), 
             xlim=my_xlim)

The logistic quantile regression that has a very nice bounded prediction:

The logistic quantile regression

Here you can see the issue with the Beta's that in the retransformed fashion differ in different regions (as expected):

# Some issues trying to display the gender factor
contrast(boot_rq_logit, list(gender=levels(gender), 
                             xtest=c(-1:1)), 
         FUN=function(x)antilogit_fn(x, epsilon))

   gender xtest Contrast   S.E.       Lower      Upper       Z      Pr(>|z|)
   Male   -1    -2.5001505 0.33677523 -3.1602179 -1.84008320  -7.42 0.0000  
   Female -1    -1.3020162 0.29623080 -1.8826179 -0.72141450  -4.40 0.0000  
   Male    0    -1.3384751 0.09748767 -1.5295474 -1.14740279 -13.73 0.0000  
*  Female  0    -0.1403408 0.09887240 -0.3341271  0.05344555  -1.42 0.1558  
   Male    1    -1.3308691 0.10810012 -1.5427414 -1.11899674 -12.31 0.0000  
*  Female  1    -0.1327348 0.07605115 -0.2817923  0.01632277  -1.75 0.0809  

Redundant contrasts are denoted by *

Confidence intervals are 0.95 individual intervals

References

  1. R. Koenker and G. Bassett Jr, “Regression quantiles,” Econometrica: journal of the Econometric Society, pp. 33–50, 1978.
  2. M. Bottai, B. Cai, and R. E. McKeown, “Logistic quantile regression for bounded outcomes,” Statistics in Medicine, vol. 29, no. 2, pp. 309–317, 2010.

For the curious the plots were created using this code:

# Just for making pretty graphs with the comparison plot
compareplot <- function(regr_plot, regr_title, true_plot){
  print(regr_plot, position=c(0,0.5,1,1), more=T)
  trellis.focus("toplevel")
  panel.text(0.3, .8, regr_title, cex = 1.2, font = 2)
  trellis.unfocus()
  print(true_plot, position=c(0,0,1,.5), more=F)
  trellis.focus("toplevel")
  panel.text(0.3, .65, "True line", cex = 1.2, font = 2)
  trellis.unfocus()
}

Cairo_png("Comp_plot_lm.png", width=10, height=14, pointsize=12)
compareplot(lm_plot, "Linear regression", true_line_plot)
dev.off()

Cairo_png("Comp_plot_rq.png", width=10, height=14, pointsize=12)
compareplot(rq_plot, "Quantile regression", true_line_plot)
dev.off()

Cairo_png("Comp_plot_logit_rq.png", width=10, height=14, pointsize=12)
compareplot(logit_rq_plot, "Logit - Quantile regression", true_line_plot)
dev.off()

Cairo_png("Scat. plot.png")
qplot(y=linpred, x=xtest, col=gender, ylab="Outcome")
dev.off()
Related Question