Solved – Controlling for variables in pooled OLS estimation in EViews

econometricsestimationpanel data

I am working on the Chinese economy and my topic of research is how external political instability can affect Chinese exports. So I want to estimate the Chinese export demand function for 1988-2011 with more than 130 countries. I want to estimate the regression equation given below.
$$
\begin{align}
\log(\mathrm{export})_{it} &= \beta_0+ \beta_1 \log(\mathrm{real gdp})_{it}+ \beta_2 \log(\mathrm{population})_{it} \\
&\quad+ \beta_3\mathrm{political stability}_{it}+ \beta_4\mathrm{realexchange rate}_{it}+ \varepsilon_{it}
\end{align}
$$
Where $\log(\mathrm{export})_{it}$ is the (log) of the export from china to other countries and $t=1988, \ldots, 2011$.

According to economic theory, the export of a country depends not only domestic GDP and population but also on the GDP and population of other countries. In my research, I want to control for the effect of Chinese GDP and population on Chinese export in EViews in a pooled OLS estimation but I don't know how to do this. If i do not control these two variables I can run a pooled OLS estimation in EViews. Any help is greatly appreciated.

Best Answer

Control variable is also an independent variable. So, it should be listed as independent variable in the model in Eviews. Then, you interpret the coefficient on rgdp of all other countries as the elasticity of exchange rate with respect to the rgdp of all other countries (if you put both in log form), other things remaining the same (i.e. controlling for all other factors).

The model should be as follows:

$$ \begin{align} \log(\mathrm{export})_{it} &= \beta_0+ \beta_1 \log(\mathrm{real gdp})_{it}+ \beta_2 \log(\mathrm{population})_{it} \\ &\quad+ \beta_3\mathrm{political stability}_{it}+ \beta_4\mathrm{realexchange rate}_{it}+ \beta_5\log(\mathrm{chinagdp})_{it}\\ &+\beta_6\log(\mathrm{chinapopn})_{it}+\varepsilon_{it} \end{align} $$ Where $\log(\mathrm{export})_{it}$ is the (log) of the export from china to other countries and $t=1988, \ldots, 2011$.

In Eviews, you have to import the data as panel data and run OLS (under estimate equation) with the following command (assuming that the variables have been log transformed, if necessary) with the variables appearing in the order as in the model (c stands for constant).

logexport c logrealgdp logpopulation politicalstability realexchangerate logchinagdp logchinapopn

Related Solutions

Ordinary Least Squares – Estimation Issues for OLS with Bounded Response Variables

Though I agree with Glen_b that rates like this are scaled counts, whether or not you want to use a count model depends on what the denominator in that scaled count is. If $y$ is something like the market share of Ford in the US, then the denominator is in the millions, and you should probably treat $y$ as continuous.

So, I'll answer the question of what you should do when it is OK to treat $y$ as a continuous variable. Specifically, $y_{it}$ is then the probability that a randomly selected member of group $i$ passes the test at time $t$. We want to let $y$ depend on some variable(s) $x$ but in a way which respects the facts that 1) $x\beta$ can be any real number and 2) $y$ nevertheless is a probability and must stay between 0 and 1.

What we want to do, I guess, is come up with a function $g(x\beta)$ so that we can model $y=g(x\beta)$ in a way which respects the nature of $y$ as a probability and will accept any real number as its argument. In addition, so that the relationship between $y$ and $x$ is not too hard to interpret, let's also require that $g$ be monotone increasing. So, do we know of any functions which have the real line as their domain, the interval $(0,1)$ as their range, and are strictly increasing?

That's an easy question, right? The cumulative distribution function of every single continuous random variable (with density strictly positive on the real line) is such a function. So, let's consider $F$ as the CDF for some continuous random variable. We might then model:

\begin{align} y_{it} &= F(x_{it}\beta) \end{align}

Hmmm. There is no error term. Two observations with the exact same $x$ will have to have the exact same $y$. That's no good. So, we need an error term. Do we put it inside the $F$ or outside? If we put it outside, then we are back to having to worry about giving it some weird distribution which keeps $y$ between 0 and 1, no matter what $F(x\beta)$ turns out to be. So, let's put it inside the $F$ and not worry about its distribution:

\begin{align} y_{it} &= F(x_{it}\beta+\epsilon_{it}) \end{align}

Now, how do we estimate it? Not with OLS because $F$ isn't linear. Not with NLS because the error term is in the wrong place (gotta be outside the $F$ for that). Maximum likelihood, maybe, if we are willing to assume a distribution for $\epsilon$. I'm allergic to assuming distributions for error terms, so not that. I like OLS, and I stubbornly want to use it. The right-hand-side of the equation above looks almost OK for OLS---the stuff inside the $F$ is just right. If only we could dig out that stuff inside the $F$. But, since $F$ is strictly increasing, it has an inverse $F^{-1}$ and this means we can dig out that good right-hand-side, hiding there inside the icky $F$:

\begin{align} y_{it} &= F(x_{it}\beta+\epsilon_{it})\\ F^{-1}(y_{it}) &= F^{-1}(F(x_{it}\beta+\epsilon_{it}))\\ F^{-1}(y_{it}) &= x_i\beta+\epsilon_{it} \end{align}

As long as you know $F$, you can just run this regression. Read in $y$ and $x$. Transform $y$ by running it through $F^{-1}$. Run the regression by OLS. Furthermore, you can use all the various techniques you know to deal with various problems with your data. Fix heteroskedasticity the way you always would, with Huber-White standard errors. Correct for clustering as you normally would. Is one of the $x$s endogenous? Use instrumental variables in the usual way. Or, in your case, I guess you are worried about either serial correlation or unobserved heterogeneity in your groups, so you want to estimate in first differences. No problem:

\begin{align} F^{-1}(y_{it}) &= x_{it}\beta+\epsilon_{it}\\ F^{-1}(y_{it}) - F^{-1}(y_{it-1}) &= (x_{it}-x_{it-1})\beta+\epsilon_{it}-\epsilon_{it-1}\\ \Delta F^{-1}(y_{it}) &= \Delta x_{it}\beta+\Delta \epsilon_{it} \end{align}

What to use for $F$? The most common choice is the logistic distribution. This has inverse function equal to $ln\left( \frac{y_{it}}{1-y_{it}} \right)$. This regression is then called a grouped data logit or a grouped data logistic regression. The second most common is normal which has an inverse function with no closed form. That regression is called a grouped data probit. Here is how it goes in R:

mydata <- data.frame(y=c(0.5,0.3,0.2,0.8,0.1,0.4),x=c(17,4,-12,1,3,5),
                     i=c(1,1,1,2,2,2),t=c(1,2,3,1,2,3))
attach(mydata)

# Apply logit transform
logity <- log(y/(1-y))

# Difference data and deal with boundary between i's
Dly <- logity[1:5]-logity[2:6]
Dx  <- x[1:5]-x[2:6]
Dly <- Dly[i[2:6]==i[1:5]]
Dx <- Dx[i[2:6]==i[1:5]]
summary(lm(Dly~Dx))

There are a couple of caveats. First, this will not work if you have any observations with either $y=1$ or $y=0$. Second, although you can interpret the sign and significance of the coefficients from your regression just the way you would for a normal regression model, you cannot interpret their magnitude in the same way (because the model is non-linear). Third, you cannot make predicted values in the way you naturally want to, as $\hat{y}=F(x\hat{\beta}_{\text{OLS}})$. This, again, is because $F$ is non-linear, so you can't just pass an expectation through it to get $\epsilon$ to go away. These latter two caveats (especially the last one) are called the re-transformation problem. You can find questions and answers on it at this site.

Solved – Is this an example of Pooled OLS on Panel Data

Least-squares estimation on a general SUR model is not the same as Pooled OLS on a Panel Data model (at least given the meaning with which these model labels are usually endowed).

Both models are "System of Equations" models, but the SUR model sprung from the observation that in many cases, the disturbance vector from each regression equation is correlated with the disturbance vectors of the other equations. So the general SUR model does not impose the restriction that the unknown coefficients under estimation are identical across equations, as is the case with the usual Panel Data setup and Pooled OLS. Also, in benchmark Panel Data models, errors are assumed independent across equations.

In symbols, the general SUR model is ($N$ is number of equations/cross sectional samples, $T$ the number of available observations for each cross section)

$$y_{it} = X_{it}\beta_i + u_{it},\;\; i=1,...,N,\;\; t=1,...,T,\;\; E[u_{it}u_{jt}]\neq 0$$

while for a Panel data model and Pooled OLS we usually have

$$y_{it} = X_{it}\beta + u_{it},\;\; i=1,...,N\;\; t=1,...,T,\;\; E[u_{it}u_{jt}]= 0$$

Note the two differences.

A usual example of SUR: you have data on a few big firms from the same national economy related to their financials (e.g. sales, profits, market share, credit lines etc), and their investment spending (dependent variable), and also to certain macroeconomic indicators for this national economy. You can reasonably argue that

a) at least some of the coefficients of an explanatory variable are not the same across equations-since how, say, profitability affects investment spending may depend on the long-term strategy of each company, which is formulated by different decision makers, etc
and
b) that at least partly some shocks/disturbances/other factors that are included in the "error term" in each regression, are common to all equations.

So you are led to the SUR model. Cross-equation restrictions on SUR models may arise -but again, they are not usually of the kind imposed in the Panel Data literature (ideally we should have one unified "Systems of equations" regression estimation theory with all the various models as special cases -but we don't).

The attempted benefit from a SUR specification is gains in estimator efficiency, and a usual estimation method is (Feasible) Generalized Least Squares.

The specific paper you mention suffers from a serious deficiency -the authors do not write down the theoretical equations describing their model, neither do they describe clearly their data set. In the beginning I thought that indeed, they had essentially implemented Pooled OLS -but this is not so, they have run SUR after all, as follows:

Their sample does not have a time-dimension (so the index "$T$" here does not represent time). Their dependent variables are "Infant mortality" , "Child mortality", "Child malnutrition" indices. So they have three equations , $N=3$. For each of the 43 countries they have $5$ such values for each dependent variable (sub sample averages), one for each asset quantile. So the $T$ dimension for each equation should have, as they say, $5 \times 43 = 215$ observations for the dependent variables (in practice they have only $T=175$ due to missing values of the dependent variables).

For each equation $i$, and for each observation $t$, they have as regressors a constant, dummies indicating the asset quantile from which the value of the dependent variable came, etc, without indicating that a specific value of the dependent variable has come from a specific country (but the regressor values of course are related to the country that the value of the dependent variable came from). So no idiosyncratic "country-specific" effect, as such (since in any case there exist regressors that refelct various aspects of each country).

So in, say, Table 3 of the paper, the first three columns (labaled 1,2,3) is a SUR system, while the next three columns (labeled 4,5,6) is another SUR system with some changes in the regressors. So, say, the coefficient on the regressor "GDP per capita" is indeed different for the three equations of the first SUR system.

The data series of the regressors for the three equations in each SUR system is identical except for the dummies indicating "asset quantile". So here SUR estimation is not numerically equivalent to equation-per-equation OLS.

The commentator "Fixed Effects" suggestion proposes a different approach: Specify $45$ cross sections/equations, each cross section a dependent variable from a specific country. For each cross-section, take the other dimension $T=5$ to be the $5$ asset quantiles (so no dummy variables for them here). (or vice versa). Etc. This would appear to suggest that we should specify three distinct FE models for each of the three dependent variables, i.e. not estimate jointly the three dependent variables... which again brings us back to the need to validate the SUR specification, by showing that indeed there appears to be correlation between the disturbance vector across equations -something that the authors appear that they haven't done/presented in the paper.

Best Answer

Related Solutions

Ordinary Least Squares – Estimation Issues for OLS with Bounded Response Variables

Solved – Is this an example of Pooled OLS on Panel Data

Related Question