Tax-Analyzer-Framework

Income Survey Data Preparation Guide
Login

Correcting Underestimation of High Incomes

Consider a situation in which the country's government has not made available tax return data, but there are income data from a recent sample survey of the country's population. In addition, there are several aggregate statistics about the number of tax filers and aggregate personal income tax (PIT) revenues for the survey year. In particular, assume that there are aggregate subgroup statistics on the number of tax returns and the taxes paid by the subgroup for one or more high-income subgroups. The challenge is to use the available aggregate statistics to correct the underestimation of high incomes that often plague household surveys.

First, note that by using aggregate tax statistics as the reference point, none of these methods can shed any light on the extent of tax evasion in the country. Completely different aggregate statistics not derived from tax records would be required to measure the extent of tax evasion, as described by Alstadsæter, Johannesen, and Zucman.

Second, the most appropriate method may well vary from country to country depending on the quality of the household survey and the efficiency of tax administration. There is even a chance that the household survey does about as well as the tax authorities in eliciting income reporting, which would mean no correction of high incomes would be necessary when using survey data rather than tax returns as model input data.

We have experimented with several methods of correcting survey data for the underestimation of high incomes. We have done some of this experimentation using synthetic data created with Monte Carlo methods, and we have also conducted experiments using real-world survey data from Malaysia, a country for which we have detailed tax return data and a realistic PIT microsimulation model. This section of the data preparation guide, summarizes the results of these experiments. Each summary points the reader to more detailed explanations that include working Python scripts that implement the correction methods.

Finding High-Income Correction is Unnecessary

As described in the work with Ethiopia data, an essential step in preparing any income survey data for use as input into a microsimulation model is imputing a nonfiler variable. The Ethiopia work did not have any aggregate statistics on high-income filers and their tax liability, so we did not know whether or not high incomes were underestimated in the calibrated data set of filers used as model input. We did know that the simulated value of overall aggregate tax revenue was correct, but the distribution of tax revenue could have been unrealistic.

When experimenting with a Malaysia income survey sample, we did have access to aggregate statistics on high income subgroups as well as for all tax filers. In particular, four aggregate tax statistics were available: total tax revenue collected from all tax filers (TAXREV), the number of tax filers (ALLFILERS), the number of tax filers with a tax liability above 20,000 ringgit (FILERS>20), and the number of tax filers with a tax liability above 40,000 ringgit (FILERS>40). Here are those aggregate tax statistics (with revenue in billions of ringgit and filers in millions of individuals):

Here are those aggregate tax statistics (with revenue in billions of ringgit and filers in millions of individuals):

TAXREV ALLFILERS FILERS>20 FILERS>40
26.01 3.861 0.282 0.134

Notice that the mean tax liability among filers is about 6,737 ringgit while only about 7.3% of the filers have a liability above 20,000 and only about 3.5% are above 40,000. In addition to these four aggregate tax statistics, the government tax authority made it clear that a nontrivial number of filers have a zero tax liability.

The details of transforming the Malaysia household income survey data into pseudo tax return data and then imputing a nonfiler variable are presented in this discussion.

After completing this two-step transform-data/impute-nonfiler process there was no high-income underestimation in the adjusted income survey data when measured relative to the tax return data. So, there was no obvious need for further work to correct the high incomes.

High-Income Correction Using Parametric and Non-Parametric Methods

This section does not present a problem based on real-world data because the Malaysia income survey data discussed in the prior section turned out to not have any major underestimation of high incomes relative to Malaysia tax return data. Given our current lack of experience with real-world data needing high-income correction, we discuss several methods that are available to solve this problem and link to examples of their use with synthetic and real-world data. We anticipate that this section will be revised as we gain more experience with these methods.

This is an active research area and there is an excellent literature review that is a useful starting point. The basic idea is to use aggregate statistics derived from tax return data to somehow impute higher incomes that are under represented or under reported in the household survey data. If there are aggregate statistics on many high income subgroups, then this imputation can be non-parametric (that is, not have to make any assumption about the shape of the distribution of high incomes). If there are fewer aggregate statistics on high income subgroups, then one would need to assume high incomes are distributed according to some statistic distribution (such as the Pareto distribution).

One parametric correction method is described in a 2017 paper by Thomas Blanchet, Juliette Fournier, and Thomas Piketty, "Generalized Pareto Curves: Theory and Applications", which is available here. In addition to the paper, the authors have provided an R package that implements this method and a web interface to that R package.

A different non-parametric correction method is described in a 2018 paper by Thomas Blanchet, Ignacio Flores, and Marc Morgan, "The Weight of the Rich: Improving Surveys Using Tax Data", which is available here. In addition to the paper, the authors have provided a Stata module bfmcorr that implements this method.

Both of these methods are well grounded in theory and have been applied to both real-world data and synthetic data generated by Monte Carlo methods. And both methods make an attempt to internally deal with the sometimes substantial differences in the type of income reported on tax returns and in household income surveys, as well as differences between the individual (the typical tax filing unit) and the households in the survey. But it could be that household incomes corrected using these methods generate noticeably different tax revenues for the high-income subgroups after the corrected household data are translated into tax filing units and processed by a tax microsimulation model. If that occurs, it is not clear how flexible these two methods are in adjusting the correction process so that the correct observations produce more accurate aggregate tax statistics for the high-income subgroups. This is an important area for future research.

Another parametric correction method has been implemented in the IncomeUnderReporting class provided by the Tax Analyzer Framework, which is written in Python. This class has been documented and has been used with several sets of synthetic data generated by Monte Carlo methods as described for top-coded income and for the missing rich and under reported high incomes. In both those situations, the IncomeUnderReporting class method synthetic_adjustment_values_mean and its class method adjust have been used to generate corrected data.

Because these IncomeUnderReporting class methods are more parameterized it appears as if they would be more flexible in adjusting the corrected high incomes to produce know aggregate tax statistics for high income subgroups. But again, this is an active research topic and more experience using these methods with real-world data is needed before even tentative recommendations can be made.

If the country for which you are developing a PIT microsimulation model has substantially underestimated high incomes, you will need to investigate the links above and try to use one or more of these three approaches to correct the underestimation of high incomes in the household survey data.