Tax-Analyzer-Framework: Data Preparation Tools

Since the beginning of 2022, the Framework has included tools that can help in the preparation of model input data. This document provides a brief description of each data-preparation tool and examples of their use. Using these tools involves writing a Python script that imports the taf package, which gives the script access to all the tools. After describing each tool, there is a discussion of different data-preparation problems that include links to Python scripts that use a tool to solve the problem.

Here is the table of contents for this document:

Tool Descriptions
Illustrative Problems
- Problem 1 : use MICE to handle top-coded incomes
- Problem 2 : estimate tax filing status
- Problem 3 : use MICE to impute missing incomes
- Problem 4 : use IUR to handle top-coded incomes
- Problem 5 : use IUR to handle missing/under-reported high incomes
- Problem 6 : use SRW to create a stratified random sample

Problem 5 is the best place to start reading about how to use the IncomeUnderReporting tool; after reading the tool description, it can be read first without reading the other problems.

Tool Descriptions

Each data-preparation tool is implemented as a Python class. The logic of each tool represents algorithms that are widely used, and have been extensively researched, in the statistics literature.

MICE (Multivariate Imputation by Chained Equations) class

The most general tool is the MICE class, which implements the MICE algorithm for imputing missing data. This algorithm can be used to handle a variety of data problems, including file matching and imputing data that are missing not at random (MNAR) when the missing data pattern is monotone. If your data-preparation problem can be characterized as a missing data problem, the MICE class is the preferred tool.

In order to use the MICE class effectively, you should be familiar with the material covered by the van Buuren book: Flexible Imputation of Missing Data, Second Edition (Chapman and Hall, 2018). Pay particular attention to the definition of the monotone missing data pattern and the MICE algorithm simplification that is possible when the missing data pattern is monotone. Also, pay attention to the discussion of the missing data mechanism in the book's sections 2.2.4 through 2.2.6, and on the need for additional aggregate data when imputing micro data that are missing not at random (MNAR), and hence, the mechanism that generates the missing data is nonignorable.

Before using the MICE class, be sure to read the documentation at the top of the source code as well as the documentation for the arguments of each class method.

IncomeUnderReporting class

This class can be used to adjust for missing or under-reported incomes (typically in survey data) at or above a threshold using either actual micro data (typically tax-return data) or synthetic micro data that are assumed to have a Pareto distribution at or above the threshold. The parameter of the Pareto distribution can be estimated given three assumptions: the threshold level, the number of (post-adjustment) incomes at or above the threshold, and the mean (post-adjustment) income at or above the threshold. Sensible values for these three assumptions can be found if three aggregate tax statistics are available (as described in one of the problems below).

The IncomeUnderReporting class implements a replacement-type algorithm that is often used in the research literature. A recent review of this literature has been provided in this Lustig paper.

Before using the IncomeUnderReporting class, be sure to read the documentation at the top of the source code as well as the documentation for the arguments of each class method.

SampleReWeighting class

This class can be used to transform a typical variable-weight survey sample into a uniform-weight sample that has many more observations, but is statistically equivalent to the variable-weight sample. This capability is helpful if you want to process the sample data with a tool that requires a sample in which each observation has the same weight (such as the IncomeUnderReporting tool). The class also can be used to reduce the number of observations in a sample by conducting stratified sampling.

Before using the SampleReWeighting class, be sure to read the documentation at the top of the source code as well as the documentation for the arguments of each class method.

Illustrative Problems

The data-preparation problems discussed below are meant to illustrate the use of the data-preparation tools. After each problem is described, there is a link to a Python script that solves the problem. Each of these scripts is structured as a test of the tool's ability to solve the problem. This means that each script is similar to a Monte Carlo study, a technique often used in the econometrics literature to assess the ability of an estimation technique to estimate correctly the true parameters of the model generating the sample data.

In these illustrative scripts, the income distribution is usually generated by drawing from a double Pareto log Normal (dPlN) distribution, which is log-normal with an optional fatter upper tail and an optional fatter lower tail. For more details, see the original article by the inventors of the dPlN distribution.

Problem 1

Imagine a situation in which micro tax-return data contain income values and micro household-survey data contain income values that are capped at a known level. In the case where the two data sets contain a common variable other than income, how can tax-return incomes be used to impute household-survey incomes above the cap level?

The common variable in both data sets implies that the MICE class can be used to impute missing (that is, under-reported) survey incomes at or above the capped level. This Python script shows how to do this. That script generates these expected results.

Note that this same problem is revisited in Problem 4 where it is solved using the IncomeUnderReporting class.

Problem 2

Imagine a situation in which micro individual data are available from a household survey and macro statistics on the number tax filers and the aggregate income of tax filers are known. How can these macro data be used to impute the missing tax-filer variable in the survey data?

The solution is to fit a tax-filer logit probability model that generates the two known macro statistics. This solution does not use a Framework data-preparation tool; instead it illustrates a technique that uses the function minimization capabilities provided by the Python scipy package. This technique is roughly analogous to estimating the parameters of an econometric model using the method of moments.

This Python script shows how to do this. First that script generates a large sample in a way that ten percent are tax filers and those ten percent earn fifty percent of aggregate income. Then it draws a two percent sample containing 20,000 people. It sets the filing-status variable for each of those 20,000 to np.nan, which signifies missing to the MICE impute method. And finally, the MICE-imputed values of the filing-status variable are compared with the actual values of the filing-status variable in the sample of 20,000.

That script generates these expected results. Notice at the very bottom of the expected results is a cross-tabulation of the actual and imputed values of the filing-status variable. It shows that roughly one percent of the 20,000 are imputed to have an incorrect value with the incorrect imputations being about evenly split between the two types of error. Of course, if the actual values of the filing-status variable are not distributed with a logit probability equation in income, then assuming that in the imputation would not generate results that are this accurate.

Problem 3

Imagine a situation in which some income variables are missing not at random (MNAR), but that the missing data pattern is monotone and aggregate information on the missing income values is known. That aggregate information (for each income variable with missing values) is the number of missing income values that are zero and the aggregate total of missing incomes. This situation is similar to one encountered recently when developing a personal income tax model for Albania. How can the aggregate statistics for each variable with missing values be used to generate impute the missing data values?

The solution is to use the MICE class because the missing data pattern is monotone and because the aggregate statistics needed to adjust the imputed values are available.

This Python script shows how to do this. Notice that the results of several different kinds of adjustments are documented at the top of the script. The adjustment factors used in each kind are the result of a calibration process in which the adjustment factors are changed until the imputed data generates the aggregate statistics. That script generates these expected results after calibration when ADJ_SCALE_ZERO=True.

Problem 4

Imagine a situation that is like the situation in Problem 1. Instead of using the MICE class, here we use the IncomeUnderReporting class. This means we ignore the common variable observed in both the tax and survey data. This Python script shows how to do this. That script generates these expected results.

This solution to the under-reporting of high incomes is inferior to the solution using the MICE class because the positive correlation between the common variable and high incomes are completely lost.

However, if the data situation was different in that there was no common variable, use of the MICE class would be impossible, and the solution described in this problem would be the only choice.

Problem 5

Imagine a situation in which there are no micro tax data available to adjust for missing or under-reported high incomes in survey data, but some external aggregate data on high incomes (say, from tabulations of tax returns) are available. There are no micro tax data available, so the MICE class cannot be used. How can we adjust the high incomes in the survey data to overcome the problems caused by the missing rich?

The solution is to use the IncomeUnderReporting class to generate synthetic high incomes and use those synthetic data to adjust survey incomes above a threshold level. The generation of synthetic data can be done using the synthetic_adjustment_values_mean class method, which (in addition to assuming that post-adjustment incomes above the threshold have a Pareto distribution) requires three adjustment assumptions:

the threshold level above which incomes are assumed to be missing or under-reported (LEVEL);
the number of (post-adjustment) incomes at or above the threshold (NUM); and
the mean (post-adjustment) income at or above the threshold (MEAN).

Obviously, these three assumptions are not knowable from the survey data. It is necessary to base these assumptions on other data, such as a few aggregate statistics tabulated from tax returns by the government tax authority.

How to do this is illustrated in this Python script, which generates these expected results. Among other things, this script shows that if the three adjustment assumptions (1)-(3) were to be known accurately, the adjustment algorithm used by the IncomeUnderReporting tool does an effective job correcting the under-representation of high incomes in the survey data. We explain how to proceed in the real-world situation where the adjustment assumptions are unknown. First, we explain the nature of the survey data, and then describe an approach for finding plausible values for the adjustment assumptions (1)-(3).

But before providing this explanation, the importance of the few aggregate tax statistics needs to be stressed. Without that outside information there is no guide to adjusting the survey data. The serious problems encountered when trying to adjust the survey data without any non-survey information are illustrated in this work.

Now back to the explanation of how to proceed in the real-world situation where the above adjustment assumptions (1)-(3) are unknown.

Survey Data Characteristics

Here we assume that incomes are distributed in a realistic way: similar to the way employment income (from micro tax data) is distributed in Albania. Before simulating missing and under-reported high incomes in the survey data, we assume incomes follow a double-Pareto log-Normal (dPlN) distribution. In this distribution there are four parameters for log(income): two for the log-normal body of the distribution (MEAN=13.2 and SDEV=0.15), one for the upper Pareto tail (ALPHA=2.3), and one for the lower Pareto tail (BETA=0.85). We use these dPlN parameter values to generate a sample of 100,000 incomes.

Next we assume that at or above an income threshold of 850,000 (which is at the 89.9 income percentile approximately) the probability of unit non-response (that is, being complete missing in the survey data) rises linearly from 30% to 100% somewhere near the top of the last income percentile (being nearly 68% at the 99.0 percentile income and nearly 87% at the 99.5 percentile income). The non-response probability is assumed to be zero below the threshold. This assumed non-response probability schedule implies that about 4.5 percent of the sample is absent from the original sample of 100,000. Because those missing people are all in the top 10.1 percent of the income distribution, nearly 45 percent of those at or above the income threshold are completely missing from the survey data.

In addition to missing high incomes, we assume that the high incomes provided by those interviewed for the survey are under-reported beginning at 1,275,000 (which is fifty percent higher than the income at which non-response begins). The under-reporting rate is assumed to increase as incomes rises above 1,275,000.

Tax Microsimulation Model

The objective of the income adjustment is to generate adjusted survey data that, when used as input to a tax microsimulation model, produce realistic tax revenue estimates. It is, therefore, sensible to assume the availability of a model that can process (adjusted or unadjusted) survey data to produce aggregate tax statistics.

In Problem 5 we specify a simple progressive income tax with three brackets: a 0% marginal rate up to 360,000, a 15% marginal tax rate in the bracket from 360,000 to 720,000, and a marginal tax rate of 30% on all income above 720,000. The model estimates the tax liability for each individual in the survey data and tabulates three aggregate tax statistics: the total tax revenue (TREVENUE), the number of taxpayers with a positive tax liability (ALLPAYERS), and the number of taxpayers with a tax liability greater than or equal to a HIGH_TAX level (HIPAYERS). Here we have set the value of HIGH_TAX to 150,000.

The value of each of these tax statistics is shown in the table below for the true incomes (as seen in tax returns tabulated by the government tax authority) and for the survey incomes (which have too few high incomes as described above).

Aggregate Tax Statistic	Tax Data	Survey Data
total revenue (billions), TREVENUE	4.026	2.108
all taxpayers (thousands), ALLPAYERS	48.00	45.52
high-tax taxpayers (thds), HIPAYERS	6.40	3.22

In this table, the Tax Data statistics are generated by processing the incomes before simulating non-response and under-reporting by the rich. These statistics correspond to the aggregate statistics we assume would be available from the government tax authority.

Notice that using the unadjusted survey data as tax model input would generate only about 52% of aggregate tax revenue seen by the government tax authority and only about half of the taxpayers who owe at least 150,000 in tax liability.

Finding the Adjustment Assumptions

In this section we explain how using the aggregate tax statistics from the government tax authority (as shown in the Tax Data column in the above table) can be used to find values for the three adjustment assumptions (1)-(3) required by the IncomeUnderReporting tool.

The basic idea is to assume a value for each of the three adjustment assumptions (1)-(3), use the tool to adjust the survey data, and then use those adjusted survey data as input into the tax microsimulation model to generate aggregate tax statistics. A calibration process can be used to vary the three assumption values until the adjusted survey data produces the aggregate tax statistics provided by the government tax authority. If we think of the government-provided tax statistics as targets, then the calibration process finds the assumptions that make the adjusted survey data produce aggregate tax statistics that hit those targets.

Given a threshold (LEVEL) assumption (1), it is easy to find the the number-above (NUM) assumption (2) and mean-above (MEAN) assumption (3) that hit the total revenue (TREVENUE) and all taxpayers (ALLPAYERS) targets. It is easy because the value of MEAN affects only TREVENUE and not ALLPAYERS. This means that one can simply vary NUM until the ALLPAYERS target is hit, and then vary MEAN until the TREVENUE target is hit (without worrying about the ALLPAYERS hit being lost).

The table below shows calibration results assuming four arbitrary threshold LEVEL assumptions with widely differing percentile locations in the unadjusted survey income distribution.

LEVEL	NUM	MEAN	TREVENUE	ALLPAYERS	HIPAYERS
600,000 (81.5 %tile)	22,260	1,066,600	4.026	48.00	6.30
740,000 (90.2 %tile)	13,960	1,307,800	4.026	48.00	6.34
900,000 (95.0 %tile)	9,430	1,548,200	4.026	48.00	6.73
1,200,000 (98.8 %tile)	5,800	1,865,000	4.026	48.00	7.68
		TARGETS:	4.026	48.00	6.40

Notice that given a threshold LEVEL, the NUM and MEAN assumptions can always be adjusted to hit the TREVENUE and ALLPAYERS targets. But the HIPAYERS statistic does not hit its target because we are varying just two assumptions (NUM and MEAN), and therefore, can hit only two targets. However, the HIPAYERS statistic implied for a threshold LEVEL (given the calibration of the NUM and MEAN assumptions) is below its target for low LEVEL assumptions and above its target for high LEVEL assumptions. This suggests that a threshold LEVEL somewhere between 740,000 and 900,000 would hit the HIPAYERS target of 6.40 thousand. If we think of the mapping between LEVEL and HIPAYERS as a function, then the LEVEL that hits the HIPAYERS target can be found using basic root-finding techniques: bisection or linear interpolation.

Here we show how to hit the HIPAYERS target with two bisections followed by a final linear interpolation. The root-finding approach is the same for both these techniques: once a bracket for the target is found (LEVEL 740,000 and LEVEL 900,000 in the table above), we select a new LEVEL between the brackets, compute the HIPAYERS statistic for the new LEVEL, and narrow the bracket by replacing the lower bracket if the new LEVEL implies a HIPAYERS statistic that is below the HIPAYERS target or by replacing the upper bracket if the new LEVEL implies a HIPAYERS statistic that is above the HIPAYERS target. This process is repeated until the bracket is sufficiently narrow to provide an estimate of the LEVEL that hits the HIPAYERS target. The only difference between the bisection and linear interpolation techniques is in how the new LEVEL is chosen. With bisection, the new level is simply the midpoint between the two bracket LEVELS. With linear interpolation, we assume that the mapping of the LEVEL into the HIPAYERS statistic is linear function in the bracket range, which means we can use the distances between the target and the statistic values implied by the two brackets to compute a new LEVEL.

Below, we show the starting bracket information from the above table and then perform two bisections.

BEFORE FIRST BISECTION:

LEVEL	NUM	MEAN	TREVENUE	ALLPAYERS	HIPAYERS
740,000	13,960	1,307,800	4.026	48.00	6.34

900,000	9,430	1,548,200	4.026	48.00	6.73
		TARGETS:	4.026	48.00	6.40

AFTER FIRST BISECTION:

LEVEL	NUM	MEAN	TREVENUE	ALLPAYERS	HIPAYERS
820,000	11,030	1,448,600	4.026	48.00	6.39

900,000	9,430	1,548,200	4.026	48.00	6.73
		TARGETS:	4.026	48.00	6.40

AFTER SECOND BISECTION:

LEVEL	NUM	MEAN	TREVENUE	ALLPAYERS	HIPAYERS
820,000	11,030	1,448,600	4.026	48.00	6.39

860,000	9,900	1,519,200	4.026	48.00	6.46
		TARGETS:	4.026	48.00	6.40

After the two bisection steps, the target is one-seventh the way between the HIPAYERS statistics implied by the lower LEVEL bracket (820,000) and the upper LEVEL bracket (860,000). Next we use linear interpolation to choose a new LEVEL of 826,000, which is approximately one-seventh of the way between 820,000 and 860,000. The following table shows the results of using this new LEVEL assumption.

AFTER LINEAR INTERPOLATION:

LEVEL	NUM	MEAN	TREVENUE	ALLPAYERS	HIPAYERS
820,000	11,030	1,448,600	4.026	48.00	6.39

826,000	10,830	1,460,800	4.026	48.00	6.40

860,000	9,900	1,519,200	4.026	48.00	6.46
		TARGETS:	4.026	48.00	6.40

Notice that the results of this whole survey data adjustment process produce an estimate of the LEVEL assumption of 826,000, which is fairly close to the LEVEL assumption of 850,000 that was used to construct the biased survey data. The difference is probably caused by several things, including: incomes above the threshold might not have exactly a Pareto distribution, and differences between the stream of random numbers used to construct the survey data and the stream of random numbers used by the IncomeUnderReporting tool to conduct the survey data adjustments.

Problem 6

Suppose you were lucky enough for the country's government to supply all personal income tax (PIT) returns for a given year. This happened, for example, when developing the Malaysia PIT microsimulation model. With nearly four million tax records, the problems were that the model execution times were more than a few seconds and the dump output was so large that it made post-simulation data analysis unwieldy. The solution was to draw a relatively large stratified random sample from all the tax records and use that sample as input for the model.

We did that using a custom Python script before the Framework's data-preparation tools were developed. This resulted in a sample that produced essentially the same PIT liabilities by income deciles as produced by when using all the tax records, but the model execution time was reduced by roughly ninety percent because the sample contained only one-tenth the number of tax records. The key to maintaining this accuracy with only a ten percent sample, is to draw a stratified sample in which 100% of the highest income taxpayers are included in the sample, 25% of upper income taxpayers are included, and only 5% of the lower and middle income taxpayers are included in the sample.

The availability of the SampleReWeighting class now makes it relatively easy to draw a stratified random sample. This Python script shows how to do this. That script generates these expected results assuming that we draw nine different stratified samples that differ only because the sampling starts with a different random number seed in each case. Notice that all nine samples generate about the same decile distribution of PIT liabilities, and that the fifth sample generates exactly the same aggregate and ten decile liabilities as generated by using all the tax returns as model input (when the PIT liabilities are rounded to the nearest one-hundredth of a billion ringgit).

When doing this stratified sampling, the aggregate current-law revenue in the nine samples (drawn using different random number seeds) varied little from the revenue estimate using the full sample: the mean absolute deviation was just 0.020 billion ringgit. By comparison the mean absolute deviation across nine non-stratified samples of the same size using the same random number seeds was 0.311 billion ringgit, which is over fifteen times larger than when using the three sampling strata and oversampling the rich. These results suggest that using a stratified sample is a sensible analysis efficiency because people in Malaysia typically round revenues to the nearest one-tenth of a billion ringgit when writing or talking about the effect of tax reforms.