Solved – How to correctly fill in missing values in panel data

missing datapanel datastatatime series

So, I have panel data that look like this:

Panel data with missing values

The data that are missing, is because we were not able to find full data in the annual reports of the banks listed in the dataset. There is no real pattern for missing values, apart from some periods as the one illustrated in the image, the missing values are mostly random. For example, one missing value in 2000, other missing value in 2002, and so on. The banks are five in total, and we include quarterly data for the period 1998Q1 to 2013Q1.

We have a full series for one of the variables, beta. The other four are all missing some values.

    Variable |       Obs        
-------------+--------------
        beta |       305
    leverage |       290
         roa |       283
       r_rwa |       277
  rwa_assets |       277

I have searched, but haven't been able to get an answer to the following questions:

1) Is it important that my dataset has missing values in some of the variables?

2) What is the right method to use for filling in those missing values? If it's possible, can you illustrate it using Stata?

Best Answer

You asked two questions

1) Is it important that my dataset has missing values in some of the variables?

This is difficult to answer well because "importance" depends on your objectives and what we are missing. It's not clear how we can judge either. The good news is that the fraction of missing values seems fairly small.

2) What is the right method to use for filling in those missing values? If it's possible, can you illustrate it using Stata?

I doubt there is agreement on a single "right" method here. Given the panel character of the data, you could try anything from numerical interpolation to multiple imputation. Interpolation could use ipolate (official Stata), cipolate (SSC), csipolate (SSC), pchipolate (SSC), nnipolate (SSC). Interpolation will inevitably not restore all the variability lost. Multiple imputation is supported by the very extensive mi suite, but taking account of both cross-sectional and time dependencies is challenging.

Related Solutions

Solved – Missing data and feature selection

I don't think eliminating data is a good idea. Let me ask you this -- the features or variables that you are trying to eliminate, how do you know if they can be ignored or not? They could play vital roles in your model. So, I would consider imputation if I were you. The paper suggested by @Ben is good but this is also a great paper on missing data and multiple imputation. It will answer and/or guide you how to deal with imputing dependent variable as well.
What are the values (1, 3 , and 5) mean? I have experience with imputation with only SAS PROC MI and if the observed values are all identical, like in your case (value=5), in any variables, PROC MI will automatically drop that variable and not impute it.

Solved – Panel data descriptives, plots and ‘feel for the data’

There's a variety of stuff you could do. Try this first:

    local rnd = floor( uniform()*1000)
    line xad xrd sale fyear if mod( ticn, 1000 ) == `rnd', by( ticn )

Try this several times, as obviously you will be getting different firms filtered in.

Best Answer

Related Solutions

Solved – Missing data and feature selection

Solved – Panel data descriptives, plots and ‘feel for the data’

Related Question