R – Visualizing Logistic Regression with Simulations

logisticrregressionsimulation

I'm trying to simulate a logistic regression and see the different parameters that I've modelled.

Here is a reproducible example and the output.

set.seed(98765)
n = 20
x1 = rnorm(n = n, mean = 6, sd = 1)
# Rescale the data
x1z = scale(x1)
z = 0 + 2*x1z  # this is the equation that gives the LOG odds. Note $p = odds/(1+odds)$
pr = 1/(1+exp(-z)) # Transform to get the LOG(odds) # inverse-logit function; Note that 1/(1+exp(-x))== exp(x)/(1+exp(x)), same as pr2 = boot::inv.logit(z)
# pr2 = exp(z)/(1+exp(z)) #  z being the log odds. Using the exp (exponential with natural number), you get the odds. The use the $p = odds/(1+odds)$ formula to get teh probability
y = rbinom(n = n, size = 1, prob = pr) # Bernoulli response variable (which is a special case of the binomial with size =1 )

# Combine the data in a dataframe 
df = data.frame(y = y, x1 = x1)

#now feed it to glm:
glm.logist = glm( y~x1, data=df, family="binomial")
glm.sum = summary(glm.logist)

par(mfrow=c(1,2))
b.5 = scales::alpha("black",.5)
plot(z~x1, ylab = "Log Odds", pch = 19, col = b.5, xlim = c(0,10), ylim = c(-12,11))
abline(a = glm.sum$coefficients[1,1],
       b = glm.sum$coefficients[2,1])
abline(h=0, v=0,lty = 3)
points(x = 0, y=glm.sum$coefficients[1,1], pch = 19, col = "red")
text(x = 0, y=glm.sum$coefficients[1,1], labels = c("Intercept"), pos =4)

glm.sum$coefficients

plot(y~x1, data = df, col = scales::alpha("black",.5), pch = 19)
abline(h=0.5, v=mean(x1),lty = 3)
newdata <- data.frame(x1=seq(min(x1), max(x1),len=n))
newdata$y = predict(object = glm.logist, newdata = newdata, type = "response") 
lines(x = newdata$x1,
      y = newdata$y, col = "red",lwd = 2)

1/(1+exp(-glm.sum$coefficients[1,1]))
1/(1+exp(-glm.sum$coefficients[2,1]))

So the "intercept" that was put into the z = 0 + 2*x1z is not shown in the first graph as I expect this to have not the same meaning as the intercept of the linear model. What is the role of the intercept in the regression model? The way it is coded, changing the intercept in the log odds here z = intercept + 2*x1z, changes the height of the line. If this "intercept" is big enough, all values are above 0 (log odds) and so all the response values are 1. So what is the meaning of that "intercept"?
Also, I know there are only 20 points in the simulation, but why is the line so bellow the data in the log odds graph?

Best Answer

The intercept in a logistic regression model represents the log-odds of response when all other covariates in the model are equal to 0. A log odds of 0 implies a probability of 0.5. The log odds theoretically needs to $\infty$ so that all responses are 1. Your description of the relation between log odds and probability of response is wrong here and needs to be checked.

The first plot is lower because you have plotted the expected log odds as the points and the actual log odds as the line. Change the seed and you'll readily see the effect of random variability in the sample.

par(mfrow=c(1,3))
b.5 = scales::alpha("black",.5)
plot(z~x1z, ylab = "Log Odds", pch = 19, col = b.5, xlim = c(-5,5), ylim = c(-12,11))
abline(a = 0,
     b = 2, col = "red")
abline(h=0, v=0,lty = 3)

plot(z~x1, ylab = "Log Odds", pch = 19, col = b.5, xlim = c(0,10), ylim = c(-12,11))
abline(a = glm.sum$coefficients[1,1],
       b = glm.sum$coefficients[2,1])
abline(h=0, v=0,lty = 3)
points(x = 0, y=glm.sum$coefficients[1,1], pch = 19, col = "red")
  text(x = 0, y=glm.sum$coefficients[1,1], labels = c("Intercept"), pos =4)

Related Solutions

R Logistic Simulation – How to Simulate Data for Logistic Regression with a Categorical Variable

The model

Let $x_B = 1$ if one has category "B", and $x_B = 0$ otherwise. Define $x_C$, $x_D$, and $x_E$ similary. If $x_B = x_C = x_D = x_E = 0$, then we have category "A" (i.e., "A" is the reference level). Your model can then be written as

$$ \textrm{logit}(\pi) = \beta_0 + \beta_B x_B + \beta_C x_C + \beta_D x_D + \beta_E x_E $$ with $\beta_0$ an intercept.

Data generation in R

(a)

x <- sample(x=c("A","B", "C", "D", "E"), 
              size=n, replace=TRUE, prob=rep(1/5, 5))

The x vector has n components (one for each individual). Each component is either "A", "B", "C", "D", or "E". Each of "A", "B", "C", "D", and "E" is equally likely.

(b)

library(dummies)
dummy(x)

dummy(x) is a matrix with n rows (one for each individual) and 5 columns corresponding to $x_A$, $x_B$, $x_C$, $x_D$, and $x_E$. The linear predictors (one for each individual) can then be written as

linpred <- cbind(1, dummy(x)[, -1]) %*% c(beta0, betaB, betaC, betaD, betaE)

(c)

The probabilities of success follows from the logistic model:

pi <- exp(linpred) / (1 + exp(linpred))

(d)

Now we can generate the binary response variable. The $i$th response comes from a binomial random variable $\textrm{Bin}(n, p)$ with $n = 1$ and $p =$ pi[i]:

y <- rbinom(n=n, size=1, prob=pi)

Some quick simulations to check this is OK

> #------ parameters ------
> n <- 1000 
> beta0 <- 0.07
> betaB <- 0.1
> betaC <- -0.15
> betaD <- -0.03
> betaE <- 0.9
> #------------------------
> 
> #------ initialisation ------
> beta0Hat <- rep(NA, 1000)
> betaBHat <- rep(NA, 1000)
> betaCHat <- rep(NA, 1000)
> betaDHat <- rep(NA, 1000)
> betaEHat <- rep(NA, 1000)
> #----------------------------
> 
> #------ simulations ------
> for(i in 1:1000)
+ {
+   #data generation
+   x <- sample(x=c("A","B", "C", "D", "E"), 
+               size=n, replace=TRUE, prob=rep(1/5, 5))  #(a)
+   linpred <- cbind(1, dummy(x)[, -1]) %*% c(beta0, betaB, betaC, betaD, betaE)  #(b)
+   pi <- exp(linpred) / (1 + exp(linpred))  #(c)
+   y <- rbinom(n=n, size=1, prob=pi)  #(d)
+   data <- data.frame(x=x, y=y)
+   
+   #fit the logistic model
+   mod <- glm(y ~ x, family="binomial", data=data)
+   
+   #save the estimates
+   beta0Hat[i] <- mod$coef[1]
+   betaBHat[i] <- mod$coef[2]
+   betaCHat[i] <- mod$coef[3]
+   betaDHat[i] <- mod$coef[4]
+   betaEHat[i] <- mod$coef[5]
+ }
> #-------------------------
> 
> #------ results ------
> round(c(beta0=mean(beta0Hat), 
+         betaB=mean(betaBHat), 
+         betaC=mean(betaCHat), 
+         betaD=mean(betaDHat), 
+         betaE=mean(betaEHat)), 3)
 beta0  betaB  betaC  betaD  betaE 
 0.066  0.100 -0.152 -0.026  0.908 
> #---------------------

Solved – Backtransform coefficients of a Gamma-log GLMM

Because the fitted response is non-linear, the slope will be constantly varying. visreg will have produced the fitted lines by predicting from the model for a range of values over date.2k for each level of PNRB.

In general you do exp(coef) so exp(-0.74) is right and -exp(0.74) is wrong. For the other groups you add the coefficients for that group (so the date.2k coef, of 0.074, plus 0.03169) and then exponentiate that: exp(-0.074 + 0.03169).

It is how you are interpreting these values as "slopes" that is defeating you. These aren't slopes, certainly not in the sense you seem to be using them. What they are are multiplicative factors. On the log scale the model is (assumed to be) linear or additive. On the original scale, additive terms become multiplicative.

Hence, the values you computed for BSKAS as

> exp(-0.074 + 0.0317)
[1] 0.9585822

indicates that for a unit increase in date.2k, the response is multiplied by ~ 0.96. If you multiply something by this value it gets smaller. By how much depends on what the value of the response was at the value of $x$ before you increased it by 1 unit.

Worked example

We don't have the fitted model (it would have simplified this) but we have sufficient information to approximate it from your output. Let's illustrate how this works for your model, by predicting $y$ for two data points $x$, 2 and 3 (i.e. date.2k = 2, and date.2k = 3) for the level BSKAS. Here's an (inefficient) function to do this but at least I don't need to create a model matrix...

pBSKAS <- function(x, backtf = FALSE) {
  ret <- 0.84997 + (-0.074 * x) + (-0.58319 * 1) + (0.0317 * x)
  if (backtf)
    ret <- exp(ret)
  ret
}

This gives for $x \in {2, 3}$

> pBSKAS(c(2,3), backtf = TRUE)
[1] 1.199830 1.150136

If we now compute the magnitude of change between the two observations

> 1.150136 / 1.199830
[1] 0.9585825

we see that same value that we worked out earlier just by exponentiating the coefficients:

> exp(-0.074 + 0.0317)
[1] 0.9585822

(it's slightly different but that will just be rounding in the other coefs.) Note that all the intercept and the coefficient for PNRBSKAS do to the calculation is put a scale on it (i.e. shift the thing up and down).

Best Answer

Related Solutions

R Logistic Simulation – How to Simulate Data for Logistic Regression with a Categorical Variable

Solved – Backtransform coefficients of a Gamma-log GLMM

Worked example

Related Question