Solved – Question about using a multiplicative dumthe variable

econometricsestimationlinear modelregression

In many econometrics model, the changes in the response variables in certain intervals are more difficult than other intervals. But I believe this is often not considered when estimating the model.

For example, suppose $Y_{st}$ represents the proportion of students in a certain school $s$, passing a standardized test in year $t$. Let $R_{st}$ be the academic resources students (ex. books in library), and $I_{st}$ represent average parental income of the students. In this case $Y_{st} \in [0,1],$ and we would like to estimate effect of $R_{st}$ on $Y_{st}.$

We could model this is as follows,

$Y_{st} = \alpha_{0} +\alpha_{1}R_{st} + \alpha_{2}I_{st} + \delta_{t}+ u_{st}$, where $u_{st}$ is additive error term, and $\delta_{t}$ are time dummies. In this context of pass rates, intuitively it is more difficult for a school to increase the pass rates from 95% to 100%, then it is for them to go from 45% student passing, to 50% student passing. Consequently, the effect of $R_{st}$ on $Y_{st}$ should be given less weight on the latter situation (45% to 50%), than the former (95% to 100%). Suppose we were comparing two schools in which the same $R_{st}$ increase lead to these results, clearly the 95% to 100% school invested more efficiently.

My idea is to use a multiplicative dummy variable with $R_{st}$, $\beta_{t}$, where $\beta_{t}$ takes on different values depending on the initial value of $Y_{st}.$ Is there a standard way to take this into consideration in the model? Are there other additional factors that could improve this model?

Best Answer

In your setting, logistic regression seems to be the natural way to go since your percentages are related to a count (number of successful students per school). The interpretation of effects through odds ratios solves your issue that it is more difficult to come from 90 to 95% than from 50 to 55%. Moreover, you can't get percentages below 0 or above 100 and you don't have problems with heteroscedasticity near the boundary.

You might want to have a look at What are the issues with using percentage outcome in linear regression? for models with a percentage response.

Related Solutions

Ordinary Least Squares – Estimation Issues for OLS with Bounded Response Variables

Though I agree with Glen_b that rates like this are scaled counts, whether or not you want to use a count model depends on what the denominator in that scaled count is. If $y$ is something like the market share of Ford in the US, then the denominator is in the millions, and you should probably treat $y$ as continuous.

So, I'll answer the question of what you should do when it is OK to treat $y$ as a continuous variable. Specifically, $y_{it}$ is then the probability that a randomly selected member of group $i$ passes the test at time $t$. We want to let $y$ depend on some variable(s) $x$ but in a way which respects the facts that 1) $x\beta$ can be any real number and 2) $y$ nevertheless is a probability and must stay between 0 and 1.

What we want to do, I guess, is come up with a function $g(x\beta)$ so that we can model $y=g(x\beta)$ in a way which respects the nature of $y$ as a probability and will accept any real number as its argument. In addition, so that the relationship between $y$ and $x$ is not too hard to interpret, let's also require that $g$ be monotone increasing. So, do we know of any functions which have the real line as their domain, the interval $(0,1)$ as their range, and are strictly increasing?

That's an easy question, right? The cumulative distribution function of every single continuous random variable (with density strictly positive on the real line) is such a function. So, let's consider $F$ as the CDF for some continuous random variable. We might then model:

\begin{align} y_{it} &= F(x_{it}\beta) \end{align}

Hmmm. There is no error term. Two observations with the exact same $x$ will have to have the exact same $y$. That's no good. So, we need an error term. Do we put it inside the $F$ or outside? If we put it outside, then we are back to having to worry about giving it some weird distribution which keeps $y$ between 0 and 1, no matter what $F(x\beta)$ turns out to be. So, let's put it inside the $F$ and not worry about its distribution:

\begin{align} y_{it} &= F(x_{it}\beta+\epsilon_{it}) \end{align}

Now, how do we estimate it? Not with OLS because $F$ isn't linear. Not with NLS because the error term is in the wrong place (gotta be outside the $F$ for that). Maximum likelihood, maybe, if we are willing to assume a distribution for $\epsilon$. I'm allergic to assuming distributions for error terms, so not that. I like OLS, and I stubbornly want to use it. The right-hand-side of the equation above looks almost OK for OLS---the stuff inside the $F$ is just right. If only we could dig out that stuff inside the $F$. But, since $F$ is strictly increasing, it has an inverse $F^{-1}$ and this means we can dig out that good right-hand-side, hiding there inside the icky $F$:

\begin{align} y_{it} &= F(x_{it}\beta+\epsilon_{it})\\ F^{-1}(y_{it}) &= F^{-1}(F(x_{it}\beta+\epsilon_{it}))\\ F^{-1}(y_{it}) &= x_i\beta+\epsilon_{it} \end{align}

As long as you know $F$, you can just run this regression. Read in $y$ and $x$. Transform $y$ by running it through $F^{-1}$. Run the regression by OLS. Furthermore, you can use all the various techniques you know to deal with various problems with your data. Fix heteroskedasticity the way you always would, with Huber-White standard errors. Correct for clustering as you normally would. Is one of the $x$s endogenous? Use instrumental variables in the usual way. Or, in your case, I guess you are worried about either serial correlation or unobserved heterogeneity in your groups, so you want to estimate in first differences. No problem:

\begin{align} F^{-1}(y_{it}) &= x_{it}\beta+\epsilon_{it}\\ F^{-1}(y_{it}) - F^{-1}(y_{it-1}) &= (x_{it}-x_{it-1})\beta+\epsilon_{it}-\epsilon_{it-1}\\ \Delta F^{-1}(y_{it}) &= \Delta x_{it}\beta+\Delta \epsilon_{it} \end{align}

What to use for $F$? The most common choice is the logistic distribution. This has inverse function equal to $ln\left( \frac{y_{it}}{1-y_{it}} \right)$. This regression is then called a grouped data logit or a grouped data logistic regression. The second most common is normal which has an inverse function with no closed form. That regression is called a grouped data probit. Here is how it goes in R:

mydata <- data.frame(y=c(0.5,0.3,0.2,0.8,0.1,0.4),x=c(17,4,-12,1,3,5),
                     i=c(1,1,1,2,2,2),t=c(1,2,3,1,2,3))
attach(mydata)

# Apply logit transform
logity <- log(y/(1-y))

# Difference data and deal with boundary between i's
Dly <- logity[1:5]-logity[2:6]
Dx  <- x[1:5]-x[2:6]
Dly <- Dly[i[2:6]==i[1:5]]
Dx <- Dx[i[2:6]==i[1:5]]
summary(lm(Dly~Dx))

There are a couple of caveats. First, this will not work if you have any observations with either $y=1$ or $y=0$. Second, although you can interpret the sign and significance of the coefficients from your regression just the way you would for a normal regression model, you cannot interpret their magnitude in the same way (because the model is non-linear). Third, you cannot make predicted values in the way you naturally want to, as $\hat{y}=F(x\hat{\beta}_{\text{OLS}})$. This, again, is because $F$ is non-linear, so you can't just pass an expectation through it to get $\epsilon$ to go away. These latter two caveats (especially the last one) are called the re-transformation problem. You can find questions and answers on it at this site.

Best Answer

Related Solutions

Ordinary Least Squares – Estimation Issues for OLS with Bounded Response Variables

Related Question