Fixed Effects Model – Comparing Python’s Linearmodels PanelOLS with Stata’s xtreg, fe

fixed-effects-modelpythonstata

I'd like to perform a fixed effects panel regression with two IVs (x1 and x2) and one DV (y), using robust standard errors. In Python I used the following command:

result = PanelOLS(data.y, sm2.add_constant(data[['x1', 'x2']]), entity_effects=True).fit(cov_type='robust')        
result

resulting in:

                          PanelOLS Estimation Summary                           
================================================================================
Dep. Variable:                      y   R-squared:                        0.0008
Estimator:                   PanelOLS   R-squared (Between):             -0.0212
No. Observations:               34338   R-squared (Within):               0.0008
Date:                Tue, May 05 2020   R-squared (Overall):           3.076e-05
Time:                        11:29:40   Log-likelihood                -4.647e+05
Cov. Estimator:                Robust                                           
                                        F-statistic:                      13.569
Entities:                        1304   P-value                           0.0000
Avg Obs:                       26.333   Distribution:                 F(2,33805)
Min Obs:                       0.0000                                           
Max Obs:                       75.000   F-statistic (robust):             71.477
                                        P-value                           0.0000
Time periods:                      88   Distribution:                 F(2,33805)
Avg Obs:                       390.20                                           
Min Obs:                       0.0000                                           
Max Obs:                       499.00                                           

                             Parameter Estimates                              
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const       5.472e+05     6478.7     84.469     0.0000   5.346e+05   5.599e+05
x1            -758.82     70.912    -10.701     0.0000     -897.81     -619.83
x2            -322.77     60.629    -5.3238     0.0000     -441.61     -203.94
==============================================================================

F-test for Poolability: 1337.3
P-value: 0.0000
Distribution: F(530,33805)

Included effects: Entity

Because the results seamed a bit off I tried to replicate them with Stata, using:

xtreg y x1 x2, fe vce(robust)

resulting in:

Fixed-effects (within) regression               Number of obs      =     34338
Group variable: ID                              Number of groups   =       531

R-sq:  within  = 0.0008                         Obs per group: min =         1
       between = 0.0010                                        avg =      64.7
       overall = 0.0004                                        max =        75

                                                F(2,530)           =      7.99
corr(u_i, Xb)  = -0.0205                        Prob > F           =    0.0004

                                   (Std. Err. adjusted for 531 clusters in ID)
------------------------------------------------------------------------------
             |               Robust
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |  -758.8212   202.0153    -3.76   0.000     -1155.67   -361.9723
          x2 |  -322.7749   219.6023    -1.47   0.142    -754.1727    108.6229
       _cons |   547249.1    22976.5    23.82   0.000     502112.9    592385.3
-------------+----------------------------------------------------------------
     sigma_u |  1266542.2
     sigma_e |  183793.32
         rho |  .97937616   (fraction of variance due to u_i)
------------------------------------------------------------------------------

The results are different. Especially the difference in p-value for x2, the average and min observations seem to be off. I do not understand what I am doing wrong in the Python version. Is the command I’m using correct? Have I missed a fundamental difference within the two models?

EDIT: as @Jesper for President pointed out there are some differences in the way Stata and Python interpret the data. Here is what I found out so far: My time variable is dates. As some dates are missing, Python seems to fill up the missing ones (Stata Obs per group max: 75 vs. Python Time Periods: 88). Further, Stata's vce(robust) does not seem to do the same like Pythons cov_type='robust'. By reading the manuals, I understand that both are including White-Sandwich estimator of variance. Nevertheless, while the results without robust standard errors are almost identical (difference in observations is the same like for robust SE), including them leads to the difference in p-values presented here. Can anybody help me to further understand the problem?

Best Answer

If anybody has a similar question, I found my mistake. According to the linearmodels manual the robust SE used (White's robust covarinace) is not robust for fixed effects. Instead clustered covariance is required. Stata seems to adapt that automatically. Therefore the correct command is:

from linearmodels import PanelOLS
fe_mod = PanelOLS(df.y,sm2.add_constant(df[['x1' , 'x2']]), entity_effects=True)
fe_res = fe_mod.fit(cov_type='clustered', cluster_entity=True)
print(fe_res)

This leads to:

                              PanelOLS Estimation Summary                           
================================================================================
Dep. Variable:                      y   R-squared:                        0.0008
Estimator:                   PanelOLS   R-squared (Between):             -0.0212
No. Observations:               34338   R-squared (Within):               0.0008
Date:                Tue, May 05 2020   R-squared (Overall):           3.076e-05
Time:                        18:04:12   Log-likelihood                -4.647e+05
Cov. Estimator:             Clustered                                           
                                        F-statistic:                      13.569
Entities:                        1304   P-value                           0.0000
Avg Obs:                       26.333   Distribution:                 F(2,33805)
Min Obs:                       0.0000                                           
Max Obs:                       75.000   F-statistic (robust):             8.0007
                                        P-value                           0.0003
Time periods:                      88   Distribution:                 F(2,33805)
Avg Obs:                       390.20                                           
Min Obs:                       0.0000                                           
Max Obs:                       499.00                                           
                                                                                
                             Parameter Estimates                              
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const       5.472e+05  2.296e+04     23.840     0.0000   5.023e+05   5.922e+05
x1            -758.82     201.83    -3.7597     0.0002     -1154.4     -363.23
x2            -322.77     219.40    -1.4712     0.1413     -752.80      107.25
==============================================================================

F-test for Poolability: 1337.3
P-value: 0.0000
Distribution: F(530,33805)

Included effects: Entity

Which is close enough to the Stata version for me to trust the results.

Related Solutions

Solved – Difference between fixed effects models in R (plm) and Stata (xtreg)

Welcome to the site, @gwatson! You are right that effect = "twoways" sets up both "individual" and "year" effects.

I tested with Produc data from R package plm and found the main results are the same (see the codes and outputs below). The only apparent difference I found is the year effect, which is caused by contrast (xtreg sets the first year as reference, while plm directly estimates the effect for each year).

## R code
data("Produc", package = "plm")
zz <- plm(gsp ~ unemp + lag(gsp), data = Produc, index = c("state","year"), method = "within", effect = "twoways")
summary(zz)

## plm output
Coefficients :
            Estimate  Std. Error  t-value  Pr(>|t|)    
unemp    -5.4525e+02  6.8611e+01  -7.9469 7.614e-15 ***
lag(gsp)  1.0125e+00  9.1789e-03 110.3029 < 2.2e-16 ***


## Stata code
use Produc, clear
xtset state year, yearly
xtreg gsp unemp l.gsp i.year, fe

## xtreg output
------------------------------------------------------------------------------
         gsp |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       unemp |   -545.246   68.61136    -7.95   0.000    -679.9537   -410.5383
      gsp L1.|   1.012464   .0091789   110.30   0.000     .9944422    1.030485
-------------+----------------------------------------------------------------

Solved – Year-fixed effects in a pre-post OLS regression analysis

The question is a bit vague to me but if I understand correctly you want to estimate something like $$y_{it} = \alpha + \delta \text{post}_t + \sum^T_{t=1}\gamma_t \text{year}_t + \sum^T_{t=1}\psi_t (\text{year}_t \cdot \text{post}_t) + \epsilon_{it}$$

where $\text{post}_t$ is the post period indicator, $\text{year}_t$ are the year dummies from 2001 to 2009, and then you have their interaction. This won't work because the $(\text{year}_t \cdot \text{post}_t)$ interaction will be perfectly collinear with the year dummies.

If you regress instead $$y_{it} = \alpha + \delta \text{post}_t + \sum^T_{t=1}\psi_t (\text{year}_t \cdot \text{post}_t) + \epsilon_{it}$$ this won't be any good either because you make the assumption that the time effect is linear in the first period and then is allowed to change by year in the post period.

To illustrate these two examples with Stata code:

// use an example data set from the web
webuse nlswork

// generate the post period indicator and year dummies
gen post = (year>=78)
qui tab year, gen(dyear)

// generate the post*year_dummy interactions
forval i = 1(1)15 {
     gen dyear`i'_post = dyear`i'*post
}

// regression with year dummies and post*year_dummy interactions
reg ln_wage post i.year dyear2_post- dyear15_post

// regression with post*year_dummy interactions
reg ln_wage post dyear2_post- dyear15_post

The inclusion of year dummies already quite flexibly controls for year fixed effects. Unless you are interested in a difference in differences type setting (for which you also need two groups besides the two periods), using a post period dummy does not change anything.

Best Answer

Related Solutions

Solved – Difference between fixed effects models in R (plm) and Stata (xtreg)

Solved – Year-fixed effects in a pre-post OLS regression analysis

Related Question