Imputing Missing Income Variables

Consider a situation in which the country's government has information on all those who paid personal income tax (PIT) for a recent year, but that many people legally do not file a return because their income is withheld at the source. Employers withhold wage and salary income and forward it to the government along with the tax identification number of each withheld individual. Dividend and other non-employment income are also withheld at the source, but those doing the withholding are not required to submit the tax identification number of each person withheld. This quirk in the tax administration system means that we can piece together a micro sample of those with employment income (including both filers and nonfilers), but we have no micro data on dividend and other income for those who do not file an annual tax return. But we do have an aggregate total of tax collected on dividend income and an aggregate total of tax collected on other income.

Ignoring the missing micro data on dividend and other income leads to a large underestimate of annual PIT revenue by the microsimulation model. This is a missing data problem that needs to be solved in order to get reasonable tax revenue estimates from the data set that combines those who filed an annual return and those who did not file an annual return (but paid taxes through the withholding process). This kind of problem was encountered when developing a PIT model for Albania.

Solution Strategy

A substantial literature has developed over the past several decades on how to impute missing data in sensible ways. The key insight in this literature is that there is no way to know whether the imputed values for missing data are exactly correct, so the goal of the imputation process should be to generate imputed values using sensible assumptions that produce plausible imputed values.

The current state of the art in missing data imputation is the MICE algorithm, where the MICE acronym stands for Multivariate Imputation by Changed Equations. The most accessible account of the MICE algorithm is the van Buuren book: Flexible Imputation of Missing Data, Second Edition (Chapman and Hall, 2018). An earlier account of the same kind of missing data imputation method is in Trivellore Raghunathan, Missing Data Analysis in Practice (Chapman & Hall, 2015) where the the "multivariate imputation by chained equations" approach is called the "sequential regression imputation method". Because it is essential to understand a number of basic concepts in these books, a reading guide to the van Buuren book has been prepared. The rest of this discussion assumes the reader is familiar with these basic concepts. Before delving into the van Buuren book, this short explanation of the MICE algorithm (illustrated with a simple data example) may provide a helpful introduction.

In this problem, the missing data are missing not at random (MNAR) and have a monotone missing data pattern.

The monotone missing data pattern means our imputation process does not need to be iterative: we can impute missing dividend income for all taxpayers with missing values, and then proceed to impute missing other income for all taxpayers with missing values of that variable. Also, our goal is to generate a single data set with imputed values replacing the missing values, so we are not doing multiple imputation as is sometimes done for public use data sets like the US Survey of Consumer Finances.

The MNAR feature of the data means we have a selection bias problem, which was first addressed by James Heckman in a different context, that requires additional external aggregate information and the use of calibration techniques in order to generate plausible imputed values for the missing data. The selection bias means that imputing missing dividend and other income using chained equations estimated on filers will lead to a large overestimate of the missing data among nonfilers (because they typically have lower incomes). The solution to this problem is two fold: an assumption was made that nonfilers with very low employment income had no dividend or other income, and the rest of nonfilers had their imputed values adjusted downward by an arbitrary amount. This arbitrary amount was varied until the imputed incomes generated the known aggregate tax revenue collected on dividend and other income. This calibration process of finding the arbitrary adjustment amount that produces (by using the data set as model input) the known aggregate tax revenues uses standard root-finding methods as described in chapter 9 of Press, et al., Numerical Recipes: The Art of Scientific Computing, Third Edition (Cambridge University Press, 2007).

A complete account of how these methods were used to solve the Albania missing data problem is available, but a more useful account of implementing these methods is described in the Solution Techniques section below.

Solution Results

Solving this problem in the manner described above produced a plausible data set that could be used as input for the Albania PIT microsimulation model. Using these input data, the model estimates 2019 PIT revenue of 46.961 billion Albanian lek (ALL), which compares to the published government statistic of 46.226 billion ALL. This difference, which is caused entirely by a difference in revenues from the taxation of employment income, is +0.735 billion ALL: about 1.6 percent above the government statistic. By construction (via the calibration process), these input data produce exact estimates of the revenue generated by the taxation of dividend and other income. The difference in revenue from the taxation of employment income is caused by duplicate tax identification numbers being assigned to foreign nationals.

Solution Methods

In order to make it easier to use the MICE algorithm to solve tax data preparation problems, the Tax Analyzer Framework includes a Python implementation of the MICE algorithm in which additive and multiplicative adjustment factors have been included. And the use of this Python MICE capability has been illustrated in several worked examples, one of which is quite similar to the Albania missing data problem being discussed here.

Tax-Analyzer-Framework

Imputing Missing Income Variables

Solution Strategy

Solution Results

Solution Methods