Logistic Regression – Diagnosing Logistic Regression with Skewed Predictor Distributions

logisticregressionskewness

I'm fitting a logistic regression model Y ~ X1…X10 to 10,000 observations, where my goal is estimate the effect of each covariate on Y.

My first issue is deciding what transformations to apply to the predictors. They all have very skewed distributions, and in some cases take on very few (<6) values. I included a typical histogram below. Would it make sense to cast them as factors, or otherwise apply some power transformation?

I'm having some trouble diagnosing the model. The fit is quite poor – the classification rate is only 65%, and the deviance is:

Null deviance: 13568  on 9999  degrees of freedom
Residual deviance: 13143  on 9989  degrees of freedom

The diagnostic plots shown below seem to suggest:

There are some very high leverage, high residual points affecting the fit of the model. I think these the points in the tails of the distributions, which don't represent the rest of the data. Would transforming the data solve this problem?
There residuals are definitely not normally distributed. There is a trend the model fails to capture. How should I find this trend?
The deviance residuals are definitely not normally distributed.

I guess my main question is how to find a suitable transformation of the data, and the best next step to explaining the remaining variance in the data.

Typical distribution for predictor.

Regression diagnostics

Best Answer

The distribution of the predictors is almost irrelevant in regression, as you are conditioning on their values. Changing to factors is not needed unless there are very few unique values and some of them are not well populated.

But with very skewed predictors the model may fit better upon a transformation. I tend to use $\sqrt{}$ and $x^{\frac{1}{3}}$ because these allow zeros, unlike $\log(x)$. Then when the sample size allows I expand the transformed variables in a regression split to make them fit adequately, e.g.

require(rms)
cuber <- function(x) x^(1/3)
f <- lrm(y ~ rcs(cuber(x1), 4) + rcs(cuber(x2), 4) + rcs(x3, 5) + sex)

rcs means "restricted cubic spline" (natural spline) and the number after the variable or transformed variable is the number of knots (two more than the number of nonlinear terms in the spline). When you make the distribution more symmetric (here with cube root), it frequently requires fewer knots to get a good fit.

AIC can help in choosing the number of knots $k$ if you force all variables to have the same number of knots. Below, $k=0$ is the same as linearity after initial transformation.

for(k in c(0, 3:7)) {
  f <- lrm(y ~ rcs(cuber(x1),k) + rcs(cuber(x2),k) + rcs(x3, k) + sex)
  print(AIC(f))
}

Related Solutions

Logistic Regression – Performance of Logistic Regression with High Number of Predictors

I think we should give the word to Venables and Ripley, page 198 in MASS:

There is one fairly common circumstance in which both convergence problems and the Hauck-Donner phenomenon can occur. This is when the fitted probabilities are extremely close to zero or one. Consider a medical diagnosis problem with thousands of cases and around fifty binary explanatory variables (which may arise from coding fewer categorical factors); one of these indicators is rarely true but always indicates that the disease is present. Then the fitted probabilities of cases with that indicator should be one, which can only be achieved by taking $\hat\beta_i = \infty$. The result from glm will be warnings and an estimated coefficient of around +/- 10.

Besides potential numerical difficulties there is no formal problem with probabilities being estimated numerically to 0 or 1. However, the $t$-test, which is based on a quadratic approximation, for testing the hypothesis $\beta_i = 0$ can become a poor approximation of the likelihood ratio test, and the $t$-test may appear insignificant though in reality the hypothesis is definitely wrong. As I understand it, this it what the warning is about.

With many predictors a situation like the one Venables and Ripley describes may easily occur; one predictor is mostly not informative, but in certain cases it is a strong predictor for a case.

Linear Regression in R – How to Interpret Model Diagnostics

This is a long and rambling question, so you are getting a long and rambling answer. Apologies. Using the example from the ?lm() call,

ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2,10,20, labels=c("Ctl","Trt"))
weight <- c(ctl, trt)
lm.D9 <- lm(weight ~ group)
summary(lm.D9)
#output#
Call:
lm(formula = weight ~ group)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.0710 -0.4938  0.0685  0.2462  1.3690 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.0320     0.2202  22.850 9.55e-15 ***
groupTrt     -0.3710     0.3114  -1.191    0.249    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.6964 on 18 degrees of freedom
Multiple R-squared: 0.07308,    Adjusted R-squared: 0.02158 
F-statistic: 1.419 on 1 and 18 DF,  p-value: 0.249

I don't entirely understand your confusion on the "coefficients." The table simply presents the OLS estimate of $\beta$, standard error of the estimate $SE(\beta)$, the "distance" that $\beta$ is from 0 on the Normal$(0, SE(\beta))$ distribution, and the probability of observing a $\beta$ that far away from 0. Forgive me for the basic statistics review; I can't tell if this is what you are asking for.

Proper OLS-estimated regression modeling (which is what the lm command runs) requires several assumptions, and these diagnostic plots are designed to test them.

The "Residuals vs Fitted" and "Scale-Location" charts are essentially the same, and show if there is a trend to the residuals. OLS models require that the residuals be "identically and independently distributed," that their distribution does not change substantially for different values of $x$. None of your charts is really satisfactory on this regard. If this assumption is not met, your $\beta$ estimates will still be good, but your $t$-statistics, and corresponding $p$-values, are garbage.

Another assumption is that the errors are approximately normally distributed, which is what the Q-Q plot allows you to see. Again, none of your plots really satisfies me in this regard. The consequences of this assumption not being met are the same as above ($\beta$'s good, $t$'s worthless).

The "outliers" principle is actually not an assumption of OLS regression. But if you have outliers in certain locations, your $\beta$ parameters will be unduly influenced by them. In this case, both your $\beta$ and $t$ measurements are useless. You can remove an influential observation from a data frame by identifying its row number and issuing the command

data <- data[-offending.row,]

Where offending.row is the number of the row you want to eliminate. The R diagnostic plots label the row numbers of potential outliers.

I don't know what kind of data you have, but you should be very careful about eliminating observations that you don't like. You should instead ask yourself how that observation became this way. If it is due to measurement error, by all means discard it. If not, then is this observation a part of the system you are trying to model? If so, you should keep it in and adapt for it in other ways.

I have two suggestions for your analysis. First, try to use GLS estimators. This method assigns weights to your observations to correct for heteroskedasticity, outliers, and some degree of non-normality. The R command for this is gls().

But it seems from your plots that your data are restricted in some ways. In particular Test-P seems like a variable that is either 1 or 0, or restricted to that range. For such a variable, you may want to look at binary logit or probit models, available with the command

glm(model, family=binomial(link="logit"))

If your data is censored at 0 but not on the upper end, a tobit model is what you want, tobit() from the AER package looks like the right thing (I've never run a tobit model, I have just looked at it theoretically).

Finally, predictions are done with the predict() function. If you want to perturb your data afterwards (to create a distribution of possible predictions), the best way I know of it to add a random number to the prediction. Using the example above,

#base prediction
pred.values <- predict(lm.D9)
# get standard error of residuals
SER <- (summary(lm.D9)$sigma)^2
#perturbations
pert <- rnorm(length(pred.values), mean=0, sd=SER)
SIMULATION.VALUES <- pred.values + pert

You can get multiple alternate simulations by repeating the last two steps. Good luck.

Best Answer

Related Solutions

Logistic Regression – Performance of Logistic Regression with High Number of Predictors

Linear Regression in R – How to Interpret Model Diagnostics

Related Question