Solved – Difference Endogeneity and Multicollinearity in Logistic Regression

endogeneitylogisticmulticollinearityregression

I am right now working with logistic regression and test my model over and over again. However, I am still not sure about the terminologies endogeneity and multicollinearity. For my under-standing, multicollinearity is a correlation of an independent variable with another independent variable. Endogeneity is the correlation of an independent variable with the error term. Is this correct so far?
Thank you!

Best Answer

Endoegeneity:

I will try to use an example here. Let’s say there is a group of students planning to sit for the GRE

• Some of them decided to register for online training courses prior to the GRE exam.

• Naturally, you would want to know whether online training courses help to obtain good scores

• To answer this question, it is tempting to use GRE scores as a dependent variable and use a dummy variable to indicate whether OR not someone took online training courses prior to the exam ( you will have other independent variable in the regression)

o Scores = b0 + b1* Online Course Indicator + b2*Age + b3* math_major + ...+ error

• The twist here is, what if weaker students chose to go through the online training program (on average). In this case you might see a negative coefficient on the dummy variable- Online Course Indicator. Because the weaker ones will have lower scores on average than the smarter ones who did not take online courses

• This coefficient might be very misleading, because it would indicate that the online course are ineffective and yields lower scores on average, which is not true.

• The true story here is that the ‘Online Course Indictor’ also measures the intelligence level of students to some degree. Why? Because the weaker ones are more likely to take online courses prior to the exam.

• Now think about the error. What does the error measure? It also measures unobservable things such as intelligence, motivation, etc….So you error is correlated with one of the independent variables, specifically “ Online Course Indicator’…THIS IS CALLED ENDOGENEITY

Multicollinearity:

I am going to use the same example. Let say you decided include the family income and a neighborhood dummy to indicate whether OR not the student is from a wealthy neighborhood.

o Scores = b0 + b1* Online Course Indicator + b2*Age + b3* family income + Dummy for Wealthy Neighborhood...+ error Notice that the family income and the Dummy for Wealthy Neighborhood are correlated. That is students from wealthy neighborhood will have higher income on average. So both variables are measuring the same thing to some extent. This we call Multucollinearity.

Best Answer

Related Solutions

Solved – Identifying multicollinearity of categorical variables in a logistic regression

Regression – Assumptions of Logistic Regression for Causal Inference Explained

Related Question