Solved – Question about using a multiplicative dumthe variable

econometricsestimationlinear modelregression

In many econometrics model, the changes in the response variables in certain intervals are more difficult than other intervals. But I believe this is often not considered when estimating the model.

For example, suppose $Y_{st}$ represents the proportion of students in a certain school $s$, passing a standardized test in year $t$. Let $R_{st}$ be the academic resources students (ex. books in library), and $I_{st}$ represent average parental income of the students. In this case $Y_{st} \in [0,1],$ and we would like to estimate effect of $R_{st}$ on $Y_{st}.$

We could model this is as follows,

$Y_{st} = \alpha_{0} +\alpha_{1}R_{st} + \alpha_{2}I_{st} + \delta_{t}+ u_{st}$, where $u_{st}$ is additive error term, and $\delta_{t}$ are time dummies. In this context of pass rates, intuitively it is more difficult for a school to increase the pass rates from 95% to 100%, then it is for them to go from 45% student passing, to 50% student passing. Consequently, the effect of $R_{st}$ on $Y_{st}$ should be given less weight on the latter situation (45% to 50%), than the former (95% to 100%). Suppose we were comparing two schools in which the same $R_{st}$ increase lead to these results, clearly the 95% to 100% school invested more efficiently.

My idea is to use a multiplicative dummy variable with $R_{st}$, $\beta_{t}$, where $\beta_{t}$ takes on different values depending on the initial value of $Y_{st}.$ Is there a standard way to take this into consideration in the model? Are there other additional factors that could improve this model?

Best Answer

In your setting, logistic regression seems to be the natural way to go since your percentages are related to a count (number of successful students per school). The interpretation of effects through odds ratios solves your issue that it is more difficult to come from 90 to 95% than from 50 to 55%. Moreover, you can't get percentages below 0 or above 100 and you don't have problems with heteroscedasticity near the boundary.

You might want to have a look at What are the issues with using percentage outcome in linear regression? for models with a percentage response.

Related Question