[Math] Low Leverage in Residuals, Logistic Regression

descriptive statisticsregressionregression analysisstatistics

I am doing an interpretation of logistic regression and I have an observation withh high residuals but low leverage.

I thought that means that it is an outlier(bad prediction) but not influenctial(if you drop it, things don´t change very much).

The point is that the cook distance and DfBetas in this observation are higher than in the rest and if I do an analysis without the outlier things change considerably.

Do you know why? I mean Cook distance dfbetas etc depends on leverage

On the other hand I have the contrary result. Points with very high leverage and low residual dont change things if you drop them.

Best Answer

The concepts of "leverage" point and "influenctial" point are not equivalent. High leverage in regression analysis refers to observations that are outlying values of the independent variables. More technically, high leverage points can be defined as those having no neighbouring points in a $R^n$ space, where $n$ is the number of independent regression variables. When leverage points occur in regression analyses, the fitted model commonly passes relatively close to them. As a result, high leverage obsevations have the potential to yield large changes in the parameter estimates when they are deleted. This explains why influential points "often" have high leverage, and vice versa. However, it is common to find observations where this is not true.

A high leverage point is not necessarily an influenctial point. The better way to understand this is to imagine a point at the extreme ranges of independent variables, but that is well aligned to the fitting model obtainable by all other observations: this point would clearly have high leverage, but its deletion would have a small effect on the model. In other words, it would be an outlier for the independent variables, but not an influenctial point.

On the other hand, a low leverage point is not necessarily a non-influenctial point. A way to understand this is to imagine a point at the middle ranges of independent variables, but that is not aligned to the fitting model obtainable by all other observations (e.g., an outlier for the dependent variable). This point would have low leverage, but its deletion could have a considerable impact on the regression model.

Related Solutions

[Math] Binary Logistic Regression Model Processing

1.How do I best assess the accuracy of the model?

Besides using the terms in hypothesis test, like p-value, you can try to compute the precision and recall (see wiki), if your response value is categorical.

2.Would it be better to reduce the number of predictor variables to improve the accuracy of the model? If so, how should I best do this?

You can improve the accuracy as while as reducing the number of predictor by adding L-1 norm of the weights of linear regression in the object function. The method called LASSO. There will be an extra parameter you need to tune to find a balance between sparsity of your model in term of number of predictor variables and the accuracy.

3.How do I check whether multicollinearity is significant? If so, what actions

should I take to improve the model?

You can achieve this by adding interaction term, like $x_1x_2$ to the set of predictor variables, where $x_1,x_2$ is your original predictor variables.

4.What outputs/plots should I produce to demonstrate the above?

You can try ROC curve, see wiki for detail.

5.Finally, is there a better way of doing things?

I think it depends on your specific problem.

[Math] Variance of residuals from simple linear regression

The (Estimated) Variance of residuals in an OLS regression is simply: $$ Var(e)=\frac{e'e}{n-(k+1)} $$ where $k+1$ is the number of regressors (plus a constant).

Best Answer

Related Solutions

[Math] Binary Logistic Regression Model Processing

[Math] Variance of residuals from simple linear regression

Related Question