Solved – How to correctly fill in missing values in panel data

missing datapanel datastatatime series

So, I have panel data that look like this:

Panel data with missing values

The data that are missing, is because we were not able to find full data in the annual reports of the banks listed in the dataset. There is no real pattern for missing values, apart from some periods as the one illustrated in the image, the missing values are mostly random. For example, one missing value in 2000, other missing value in 2002, and so on. The banks are five in total, and we include quarterly data for the period 1998Q1 to 2013Q1.

We have a full series for one of the variables, beta. The other four are all missing some values.

    Variable |       Obs        
-------------+--------------
        beta |       305
    leverage |       290
         roa |       283
       r_rwa |       277
  rwa_assets |       277

I have searched, but haven't been able to get an answer to the following questions:

1) Is it important that my dataset has missing values in some of the variables?

2) What is the right method to use for filling in those missing values? If it's possible, can you illustrate it using Stata?

Best Answer

You asked two questions

1) Is it important that my dataset has missing values in some of the variables?

This is difficult to answer well because "importance" depends on your objectives and what we are missing. It's not clear how we can judge either. The good news is that the fraction of missing values seems fairly small.

2) What is the right method to use for filling in those missing values? If it's possible, can you illustrate it using Stata?

I doubt there is agreement on a single "right" method here. Given the panel character of the data, you could try anything from numerical interpolation to multiple imputation. Interpolation could use ipolate (official Stata), cipolate (SSC), csipolate (SSC), pchipolate (SSC), nnipolate (SSC). Interpolation will inevitably not restore all the variability lost. Multiple imputation is supported by the very extensive mi suite, but taking account of both cross-sectional and time dependencies is challenging.

Related Question