Solved – Is it good practice to standardize your data in a regression with panel/longitudinal data

rregressionstandardization

In general, I standardize my independent variables in regressions, in order to properly compare the coefficients (this way they have the same units: standard deviations). However, with panel/longitudinal data, I'm not sure how I should standardize my data, especially if I estimate a hierarchical model.

To see why it can be a potential problem, assume you have $i = 1, \ldots, n$ individuals measured along $t=1,\ldots, T$ periods and you measured a dependent variable, $y_{i,t}$ and one independent variable $x_{i,t}$. If you run a complete pooling regression, then it's ok to standardize your data in this way: $x.z = (x- \text{mean}(x))/\text{sd}(x)$, since it will not change t-statistic. On the other hand, if you fit an unpooled regression, i.e., one regression for each individual, then you should standardize your data by individual only, not the whole dataset (in R code):

for (i in 1:n) {
  for ( t in 1:T) x.z[i] =  (x[i,t] - mean(x[i,]))/sd(x[i,]) 
}

However, if you fit a simple hierarchical model with a varying intercept by individuals, then you are using a shrinkage estimator, i.e, you are estimating a model between pooled and unpooled regression. How should I standardize my data? Using the whole data like a pooled regression? Using only individuals, like in the unpooled case?

Best Answer

I can't see that standardization is a good idea in ordinary regression or with a longitudinal model. It makes predictions harder to obtain and doesn't solve a problem that needs solving, usually. And what if you have $x$ and $x^2$ in the model. How do you standardize $x^2$? What if you have a continuous variable and a binary variable in the model? How do you standardize the binary variable? Certainly not by its standard deviation, which would cause low prevalence variables to have greater importance.

In general it's best to interpret model effects on the original scale of $x$.

Related Question