Ordinary Least Squares – Estimation Issues for OLS with Bounded Response Variables

econometricsestimationleast squarespanel dataregression

Suppose we have a first differences regression with a bounded response variable,
$$\Delta y_{i} = \beta_1 \Delta X_{i}+\Delta\epsilon_i.$$ For example, suppose $y_{it} \in [0,1]$ where $y_{it}$ could represent participation of a population on a given survey, or the pass rates of students on a standardized test in a given school $i$ for a given year $t$. $X_{it}$ is some variable that changes over time, for example if $y_{it}$ is pass rates in a given school, $X_{it}$ can be average class size. Our goal is to get a unbiased estimate for $\beta_1.$

If we estimate the above model by using OLS, the following problem could occur. For simplicity suppose we have two periods, $t =1, 2.$ Then if $y_{i1} = 1$ then $\Delta y_{i} \in [-1,0],$ whereas if $y_{i1} = 0$ then $\Delta y_{i} \in [0,1].$ Therefore if $y_{i1}$ is close to $1$, it can only increase by little. Whereas if $y_{i1}$ is close to $0$, it can increase by a lot. But realistically this can lead to biased results. This is because if the $y_{it}$ was not bounded, an increase in $X_i$ in the second period may increase $y_i$ in second period, but this is impossible in our case with bounded response when $y_{i1} = 1.$ The issue is similar if increasing $X_i$ leads to a decrease in $y_i$ when $y_{i1} = 0.$

Hence I suspect OLS might not be a appropriate estimator in this case, but under other circumstances it may work? The idea I have to fix this problem is to take into consideration in the model the following. Realistically given a low initially $y_{i1}$ (less than 0.5) it is easier to have a higher $y_{i2}$ due to change in $X_i$, then to increase $y_i$ given initially $y_{i1}$ high (more than 0.5). I am not sure how we would exactly do this, I suspect using a indicator variable might work. Would this type of approach work? Any other ideas to fix this capping issue?

Best Answer

Though I agree with Glen_b that rates like this are scaled counts, whether or not you want to use a count model depends on what the denominator in that scaled count is. If $y$ is something like the market share of Ford in the US, then the denominator is in the millions, and you should probably treat $y$ as continuous.

So, I'll answer the question of what you should do when it is OK to treat $y$ as a continuous variable. Specifically, $y_{it}$ is then the probability that a randomly selected member of group $i$ passes the test at time $t$. We want to let $y$ depend on some variable(s) $x$ but in a way which respects the facts that 1) $x\beta$ can be any real number and 2) $y$ nevertheless is a probability and must stay between 0 and 1.

What we want to do, I guess, is come up with a function $g(x\beta)$ so that we can model $y=g(x\beta)$ in a way which respects the nature of $y$ as a probability and will accept any real number as its argument. In addition, so that the relationship between $y$ and $x$ is not too hard to interpret, let's also require that $g$ be monotone increasing. So, do we know of any functions which have the real line as their domain, the interval $(0,1)$ as their range, and are strictly increasing?

That's an easy question, right? The cumulative distribution function of every single continuous random variable (with density strictly positive on the real line) is such a function. So, let's consider $F$ as the CDF for some continuous random variable. We might then model:

\begin{align} y_{it} &= F(x_{it}\beta) \end{align}

Hmmm. There is no error term. Two observations with the exact same $x$ will have to have the exact same $y$. That's no good. So, we need an error term. Do we put it inside the $F$ or outside? If we put it outside, then we are back to having to worry about giving it some weird distribution which keeps $y$ between 0 and 1, no matter what $F(x\beta)$ turns out to be. So, let's put it inside the $F$ and not worry about its distribution:

\begin{align} y_{it} &= F(x_{it}\beta+\epsilon_{it}) \end{align}

Now, how do we estimate it? Not with OLS because $F$ isn't linear. Not with NLS because the error term is in the wrong place (gotta be outside the $F$ for that). Maximum likelihood, maybe, if we are willing to assume a distribution for $\epsilon$. I'm allergic to assuming distributions for error terms, so not that. I like OLS, and I stubbornly want to use it. The right-hand-side of the equation above looks almost OK for OLS---the stuff inside the $F$ is just right. If only we could dig out that stuff inside the $F$. But, since $F$ is strictly increasing, it has an inverse $F^{-1}$ and this means we can dig out that good right-hand-side, hiding there inside the icky $F$:

\begin{align} y_{it} &= F(x_{it}\beta+\epsilon_{it})\\ F^{-1}(y_{it}) &= F^{-1}(F(x_{it}\beta+\epsilon_{it}))\\ F^{-1}(y_{it}) &= x_i\beta+\epsilon_{it} \end{align}

As long as you know $F$, you can just run this regression. Read in $y$ and $x$. Transform $y$ by running it through $F^{-1}$. Run the regression by OLS. Furthermore, you can use all the various techniques you know to deal with various problems with your data. Fix heteroskedasticity the way you always would, with Huber-White standard errors. Correct for clustering as you normally would. Is one of the $x$s endogenous? Use instrumental variables in the usual way. Or, in your case, I guess you are worried about either serial correlation or unobserved heterogeneity in your groups, so you want to estimate in first differences. No problem:

\begin{align} F^{-1}(y_{it}) &= x_{it}\beta+\epsilon_{it}\\ F^{-1}(y_{it}) - F^{-1}(y_{it-1}) &= (x_{it}-x_{it-1})\beta+\epsilon_{it}-\epsilon_{it-1}\\ \Delta F^{-1}(y_{it}) &= \Delta x_{it}\beta+\Delta \epsilon_{it} \end{align}

What to use for $F$? The most common choice is the logistic distribution. This has inverse function equal to $ln\left( \frac{y_{it}}{1-y_{it}} \right)$. This regression is then called a grouped data logit or a grouped data logistic regression. The second most common is normal which has an inverse function with no closed form. That regression is called a grouped data probit. Here is how it goes in R:

mydata <- data.frame(y=c(0.5,0.3,0.2,0.8,0.1,0.4),x=c(17,4,-12,1,3,5),
                     i=c(1,1,1,2,2,2),t=c(1,2,3,1,2,3))
attach(mydata)

# Apply logit transform
logity <- log(y/(1-y))

# Difference data and deal with boundary between i's
Dly <- logity[1:5]-logity[2:6]
Dx  <- x[1:5]-x[2:6]
Dly <- Dly[i[2:6]==i[1:5]]
Dx <- Dx[i[2:6]==i[1:5]]
summary(lm(Dly~Dx))

There are a couple of caveats. First, this will not work if you have any observations with either $y=1$ or $y=0$. Second, although you can interpret the sign and significance of the coefficients from your regression just the way you would for a normal regression model, you cannot interpret their magnitude in the same way (because the model is non-linear). Third, you cannot make predicted values in the way you naturally want to, as $\hat{y}=F(x\hat{\beta}_{\text{OLS}})$. This, again, is because $F$ is non-linear, so you can't just pass an expectation through it to get $\epsilon$ to go away. These latter two caveats (especially the last one) are called the re-transformation problem. You can find questions and answers on it at this site.

Related Question