Adding Untaxed People and Imputing Missing Demographic Variables

Consider a situation in which available tax return data are complete, but that there is an interest in adding nonfilers to the model's input data set. The main motivation for doing this is to be able to use a microsimulation model to simulate the effects of tax reforms that will increase the number of tax filers. Analyzing reforms such as lowering the standard deduction or introducing a refundable tax credit would require such an expanded model input data set. The usefulness of such expanded input data is higher in countries where only a modest faction of the population are tax filers.

Obviously, household income survey data are needed in order to add nonfilers to a tax return data set. The key issue is to identify those in a household survey who are likely to be tax filers rather than nonfilers. This is essential because only those viewed as nonfilers will be added to the tax return data set, otherwise there would be double counting of tax filers in the merged data set. But, because household income surveys generally don't ask if an individual is a tax filer, the missing tax filer variable needs to be imputed.

And once we have an imputed nonfiler variable, the demographic information in the household survey data can be used to impute demographic variables to the tax return records in the merged data set. Such demographic variables are rarely included in tax return data because they have no direct bearing on tax liability.

This situation was encountered when experimenting with Malaysian household income survey and tax return data as part of Task 1 of that experimental work.

Solution Strategy

Creating the nonfiler variable and imputing demographic variables to the tax return records are missing data problems. But the nonfiler variable cannot be imputed using the MICE algorithm because the tax return data set has no variation in the value of the nonfiler variable: it is false for all the tax return records. Because some tax returns have zero tax liability, it is not as simple as specifying an income threshold below which people are nonfilers. Instead we construct of nonfiler probability function (that is used to impute the nonfiler variable in the household data) and then use the model to compute the simulated number of PIT filers and their aggregate tax liability. A calibration process is used to search for the nonfiler probability function that generates a household sample of filers whose simulated number and tax liability are close to the known values for those two aggregate statistics.

After finding a plausible filing probability function, the missing demographic variables in the tax return data can be easily imputed using the MICE algorithm because they can be considered to be missing at random. There is little reason to expect there to be systematic differences between the household records that are imputed to be filers and the tax return records after controlling for the employment income and other common variables in the chained equations used by the MICE algorithm.

Solution Results

The non-decreasing filing probability function of tax liability is assumed to be piece-wise linear and be defined by four parameters: the filing probability at zero tax liability (FP0), the filing probability at some middle tax liability (FPM), that middle tax liability amount (TXM), and the higher tax liability amount at which the filing probability becomes one (TX1). So, everyone with a simulated tax liability above TX1 is imputed to be a filer. And everyone below TX1 is a filer only if a random number (uniformly distributed between zero and one) for that individual is less than that individual's filing probability.

The values of those four parameters were found by brute-force optimization over a grid that contained about six thousand points (as described in more detail below). The optimization objective function was the sum of the squared differences between the model simulated and know values of two statistics: the total number of tax filers (3.861 million) and aggregate PIT revenue (26.01 billion ringgit). The solution to this optimization is:

FP0	FPM	TXM	TX1
0.035	0.060	200	350

So, the filing probability is 3.5% at zero tax liability, rises linearly to 6.0% when tax liability is 200 ringgit (which was roughly equivalent to forty-eight US dollars), and then rises sharply in a linear fashion to 100% when tax liability is 350 ringgit or higher.

Then the household survey individuals who are imputed to be filers are used to impute a number of demographic variables for the tax return individuals using standard MICE algorithm methods for data that are missing at random and have a monotone missing data pattern.

Finally, the survey individuals who are simulated to be nonfilers are merged with all the tax return data (that, by definition, represent tax filers). The resulting data set contains 832,725 records: 446,640 are from the uniform-weight (of 20) sample of survey individuals who are simulated to be nonfilers and 386,085 are from a 10% stratified sample of all tax returns. This sample represents the whole population, has tax return information on those with high incomes (for accurate analysis of most tax reforms) that has been supplemented with imputed demographic information and non-taxable incomes. And the records that represent nonfilers have employment income and transfer income, which is likely to be the bulk of nonfiler income in most cases.

This solution can support analysis of tax reforms that affect individuals who are not filing under current tax law. But this solution does not generate a sample that can support household inequality measurement, which typically involves computing some income inequality measure (for example, a Gini coefficient) for household incomes adjusted by household size. While this solution does support that kind of calculation for the nonfilers added to the tax returns, the PIT return data almost never contains information about the tax filer's spouse and household. As a result, there is no way to combine the individual tax returns observations into household units. So, if computing household income inequality statistics (before and after taxes) is the main objective, it would be better to start with household income survey data even if tax return data are available. Alternatively, one could develop some kind of statistical method to pair individuals in the micro tax return data into pseudo households. We have not experimented with such an approach because it is likely to encounter a number of other problems, such as missing data on the age and marital status of tax filers and possibly the number of dependents, all of which would need to be resolved in order to generate a plausible pairing of individual filers into households.

Solution Methods

Even though the Malaysia tax return and household income survey data are confidential, the Python script filer.py that fits the filing probability function is available as a guide for how to do this task. This script illustrates how to use the scipy optimize package to find the point in the four-parameter grid space that produces the most realistic filer/nonfiler division in the household data. This is an important task because it will need to be done anytime tax return and survey data are being merged into a data set that represents the whole population, or anytime a nonfiler variable is being imputed to survey data that are being prepared to be used as model input.

Also, available is the Python script impute_i2t.py that imputes demographic variables from the imputed filer subsample of the household income survey to tax return records. This is the only role for the filer subsample as it is not included in the merged data set that represents the whole population. This application of the MICE algorithm is more straightforward because the demographic variables can be considered missing at random in the tax return data.

Tax-Analyzer-Framework

Tax Return Data Preparation Guide

Adding Untaxed People and Imputing Missing Demographic Variables

Solution Strategy

Solution Results

Solution Methods