Solved – Shifted intercepts in logistic regression

logisticmachine learningnonlinear regressionpredictive-models

I have a question about the effects of shifting the intercept in a logistic fit on the mean of a particular transformation of the scores.

Here is the notation I will be using for the question. The logistic regression model is
$$
\begin{align}
\mathbb{P}[Y_i =1 \mid \boldsymbol{X}_i] &= \dfrac{1}{1+\exp(-\boldsymbol{X}_i'\boldsymbol{\beta})} \\[1em]
&\equiv p(\boldsymbol{X}_i; \boldsymbol{\beta})
\end{align}
$$
where $p$ is the score function. I estimate the parameters of the model, and denote them $\widehat{\boldsymbol{\beta}}$, so that the estimated scores of the model are $p(\boldsymbol{X}_i; \widehat{\boldsymbol{\beta}})$. I am interested in the sample mean of all the estimated scores that cross a certain threshold, $\delta$, say. That is, the quantity
$$
\dfrac{\sum_{i=1}^n p(\boldsymbol{X}_i; \widehat{\boldsymbol{\beta}})\bf{1}_{\left[p(\boldsymbol{X}_i; \widehat{\boldsymbol{\beta}})\geq \delta\right]}}{\sum_{i=1}^n\bf{1}_{\left[p(\boldsymbol{X}_i; \widehat{\boldsymbol{\beta}})\geq \delta\right]}}
$$

Now suppose that I shift the estimated linear index for all observations by the same quantity $\alpha$, and denote the transformed scores as

$$
\begin{align}
q(\boldsymbol{X}_i; \widehat{\boldsymbol{\beta}}) &= \dfrac{1}{1+\exp(-(\boldsymbol{X}_i'\widehat{\boldsymbol{\beta}}+\alpha))} \\
&= \dfrac{\exp (\alpha)}{\exp (\alpha)+\exp(-\boldsymbol{X}_i'\widehat{\boldsymbol{\beta}})}
\end{align}
$$

It is easy to see that if
$$
\begin{alignat}{2}
& \alpha &\geq& 0\\
\implies\,& \exp(\alpha) &\geq& 1\\
\implies\,& q(\boldsymbol{X}_i; \widehat{\boldsymbol{\beta}}) &\geq& p(\boldsymbol{X}_i; \widehat{\boldsymbol{\beta}}) \\
\implies\,& \sum_{i=1}^n q(\boldsymbol{X}_i; \widehat{\boldsymbol{\beta}})\bf{1}_{\left[q(\boldsymbol{X}_i; \widehat{\boldsymbol{\beta}})\geq \delta\right]} &\geq& \sum_{i=1}^n p(\boldsymbol{X}_i; \widehat{\boldsymbol{\beta}})\bf{1}_{\left[p(\boldsymbol{X}_i; \widehat{\boldsymbol{\beta}})\geq \delta\right]}\,\text{and also,}\\
\implies & \sum_{i=1}^n\bf{1}_{\left[q(\boldsymbol{X}_i; \widehat{\boldsymbol{\beta}})\geq \delta\right]} &\geq& \sum_{i=1}^n\bf{1}_{\left[p(\boldsymbol{X}_i; \widehat{\boldsymbol{\beta}})\geq \delta\right]}
\end{alignat}
$$
However, since the denominator and the numerator both become larger for the $q$ function, it is hard to tell what will happen to the ratio,
$$
\dfrac{\sum_{i=1}^n p(\boldsymbol{X}_i; \widehat{\boldsymbol{\beta}})\bf{1}_{\left[p(\boldsymbol{X}_i; \widehat{\boldsymbol{\beta}})\geq \delta\right]}}{\sum_{i=1}^n\bf{1}_{\left[p(\boldsymbol{X}_i; \widehat{\boldsymbol{\beta}})\geq \delta\right]}}
$$
that is
$$ \dfrac{\sum_{i=1}^n q(\boldsymbol{X}_i; \widehat{\boldsymbol{\beta}})\bf{1}_{\left[q(\boldsymbol{X}_i; \widehat{\boldsymbol{\beta}})\geq \delta\right]}}{\sum_{i=1}^n\bf{1}_{\left[q(\boldsymbol{X}_i; \widehat{\boldsymbol{\beta}})\geq \delta\right]}}\overset{?}{\lesseqqgtr} \dfrac{\sum_{i=1}^n p(\boldsymbol{X}_i; \widehat{\boldsymbol{\beta}})\bf{1}_{\left[p(\boldsymbol{X}_i; \widehat{\boldsymbol{\beta}})\geq \delta\right]}}{\sum_{i=1}^n\bf{1}_{\left[p(\boldsymbol{X}_i; \widehat{\boldsymbol{\beta}})\geq \delta\right]}}$$

I am pretty sure there is an example like this in Theory of Point Estimation by Lehmann and Casella, for the univariate (non-regression case), but I have been unable to find it.

Simulations

I decided to simulate the above scenario to see if I could get an intuition for this problem. The data generating process that I have used for generating the data is slightly non-standard, since I wanted to make sure that my simulations were robust to distributional assumptions. It might be useful to try different parameters configurations in choosing the fixed parameters.

Here is the R code for the simulations (I would appreciate it if someone could confirm that this code is indeed doing what I think it is).

# purpose: simulate logistic regression
# date created: 27th October 2012
# comments:
# author:
# dependencies: faraway

library(faraway)

# simulation parameters
vBeta <- rbind(0.1, 0.2, 0.3, 0.4)  # vector of coefficients
sDelta <- 0.16  # threshold for the scores
sAlpha <- 0.7  # shift in linear index

# matrix to hold the results
mResults <- matrix(0, nrow=1000, ncol=2)

for(i in seq(from=1, to=1000, by=1)) {
  # simulate the data
  mX <- matrix(rnorm(400, 4, 1), 100, 4)
  vY <- (0.4 + mX%*%vBeta + rt(n=100, df=7)>=5)
  data <- as.data.frame(cbind(vY,mX))

  # logistic regression
  resLogitFit <- glm(V1~V2+V3+V4+V5, binomial(link = "logit"), data=data)

  # save the mean of the scores that are above a certain value
  mResults[i,1] <- mean(resLogitFit$fitted.values[resLogitFit$fitted.values > sDelta])

  # effect the logit transformation on the shifted linear index
  data$shiftedScores <- ilogit(resLogitFit$linear.predictors+sAlpha)
  mResults[i, 2] <- mean(data$shiftedScores[data$shiftedScores > sDelta])
}

# plot the two means across simulations
matplot(y=mResults, type="l",ylim=c(0.2, 1), xlab="Simulation #", 
        ylab="Estimated statistic")
legend(1, 1, c("(mean) truncated scores", "(mean) shifted truncated scores"),
       cex=0.8, col=c("black","red"), lty=1:2)

# count the number of times the shifted scores exceed the older means
sum(mResults[,1] > mResults[,2])

This produces the following graph, which seems to indicate that the shifted truncated scores have a higher mean — although this depends on the shift parameter.

enter image description here

I would very much like a theoretical explanation to support these simulation results.

Best Answer

Your shifted, average score is:

\begin{align} M(X,\beta,\delta,\alpha) &= \frac{\sum_{p(X\beta+\alpha)>\delta} p(X\beta+\alpha)}{\sum \mathbf{1}_{p(X\beta+\alpha)>\delta}} \end{align}

Your question is basically "Is $M$ monotonically increasing in $\alpha$?" The answer is that it is not. Once you put the question this way, it is easy to see why it is not. At "most" values of $\alpha$, $M$ is going to be increasing. The numerator is increasing in $\alpha$ and the denominator is unaffected by $\alpha$. At a very few special values of $\alpha$ (the points for which $X_i\beta+\alpha=\delta$ for some $i$, $M$ will decrease in $\alpha$. Why? Because the numerator will increase by exactly $\delta$ at one of these points, and the denominator will also increase by exactly 1 at one of these points. Since the fraction is higher than $\delta$ all the time, this will drag it down at that point (in effect, you are suddenly including a new data point in your average which you know is less than the average was before you included it).

enter image description here

The little down-ticks are where the data lie. To make the graph, I slightly modified your R program. I generate a single 100-observation dataset. Then I shift it using $\alpha=0$ and take the truncated mean of the scores. Then I shift it using $\alpha=0.001$ and use the truncated mean of the scores. And so on. Here is the R script:

# This program written in response to a Cross Validated question
# http://stats.stackexchange.com/questions/41267/shifted-intercepts-in-logistic-regression

# The program graphs the expectation of a shifted logit score conditional on the score passing
# some threshold.  The conditional mean is not monotonic in the shift.

library(faraway)
library(plyr)

set.seed(12344321)

# simulation parameters
vBeta <- rbind(0.1, 0.2, 0.3, 0.4)  # vector of coefficients
sDelta <- 0.16  # threshold for the scores

# simulate the data
mX <- matrix(rnorm(400, 4, 1), 100, 4)
vY <- (0.4 + mX%*%vBeta + rt(n=100, df=7)>=5)
data <- as.data.frame(cbind(vY,mX))

# logistic regression
resLogitFit <- glm(V1~V2+V3+V4+V5, binomial(link = "logit"), data=data)
raw.scores <- resLogitFit$fitted.values

# mean of scores bigger than delta:
mean(raw.scores[raw.scores>sDelta])


# Create mean greater than delta for a variety of alphas

shift.logit.mean <- function(alpha,delta,scores){
  alpha <- as.numeric(alpha)
  shifted <- ilogit(logit(scores) + alpha)
  return(mean(shifted[shifted>delta]))
}

results <- ddply(data.frame(alpha=seq(from=0,to=1000,by=1)/1000),.(alpha),
                    shift.logit.mean,delta=sDelta,scores=raw.scores)  
names(results)[2]<-"shifted.scores"

plot(results,type="l",main="Scores not monotonic in alpha")

# Now. let's artificially pile up the data right near the delta cut point:

raw.scores[1:10] <- sDelta - 1:10/1000

results <- ddply(data.frame(alpha=seq(from=0,to=1000,by=1)/1000),.(alpha),
                 shift.logit.mean,delta=sDelta,scores=raw.scores)  
names(results)[2]<-"shifted.scores"

plot(results,type="l",main="With scores piled up near delta")

Now that we know that the graph is going to go down wherever there are data, it is easy to make it go down a lot. Just modify the data so that a whole bunch of scores are just a little less than $\delta$ (or, really, just grouped close together anywhere to the left of the original cut point). The R script above does that, and here is what you get:

enter image description here

I got it to go down really fast for low values of $\alpha$ by piling up scores just to the left of $\delta$, the cut point.

OK, so we know that, theoretically, there is not going to be any monotonicity result. What can we say? Not much, I think. Obviously, once $\alpha$ gets big enough that all the scores pass the cut point, the function is going to be monotonic and is going to aymptote at 1. That's about it, though. You can make the function go down, on average, locally by putting a lot of data points there. You can make the function go up locally by not putting any data points there.

Now, suppose we have a really big dataset, so that it would be OK to approximate the sums by integrals ($f(X)$ is the multivariate density of $X$):

\begin{align} M(X,\beta,\delta,\alpha) &= \frac{\int_{p(X\beta+\alpha)>\delta} p(X\beta+\alpha) f(X)}{\int_{p(X\beta+\alpha)>\delta}f(X)} \end{align}

The derivative of this in $\alpha$ is kind of ugly. However, you get the same result. At $\alpha$s for which the denominator is increasing a lot (where $f(X)$ is high), you can get a negative derivative.

Simulations

Best Answer

Related Solutions

Solved – How to make stochastic gradient descent algorithm converge to the optimum

Logistic Regression – Different Ways to Produce Confidence Interval for Odds Ratio

Related Question