MYI-Tax-Analyzer: Survey Data

Experimental Use of Household Income Survey (HIS) Data with the Tax Analyzer

Typically the Tax Analyzer uses a sample of tax returns as input data. This usage of the model serves well when estimating the aggregate revenue and distributional effects of tax reforms that affect only those who are filing tax returns under current-law policy.

However, using tax returns as model input data is not ideal in at least three other non-standard analysis situations. First, if considering a tax reform (such as a smaller zero-rate tax bracket or a new refundable tax credit) that will bring new filers into the tax system, a sample of returns under current-law policy is not adequate to estimate the reform's impact on government tax revenues or after-tax household incomes. Second, if considering the effect of taxes on the distribution of after-tax household income, a sample of returns is not adequate to compute population-wide distributional statistics (such as a Gini coefficient) for pretax and after-tax incomes. And third, if considering development of a tax microsimulation model for a country that has not made a sample of tax returns available, obviously another source of input data must be used.

The shortcomings of using tax returns as model input are magnified in countries where only a modest fraction of working people file a tax return.

World Bank staff are experimenting with different ways of using household survey data (that represent the whole population) as tax model input. This web page serves as a place where staff working on this project can exchange information.

Here is an outline of what is covered in the rest of this document:

Overview
Techniques
Strategies
- Task 1: merging micro tax data and nonfiler survey data
- Task 2a: adjusting survey data and imputing tax variables
- Task 2b: adjusting survey data using just aggregate tax data
- Task 2c: adjusting survey data without any tax information
Files

Overview

The experimentation with the MYI-Tax-Analyzer microsimulation model, which so far has used tax returns as input, has focused on different ways survey data (from a 2019 Household Income Survey sample provided to the World Bank by the Department of Statistics Malaysia) could be used as model input. Four approaches to using the survey data have been identified and each task listed below involves implementing one of these experimental approaches. Each of the tasks involve taking similar preliminary steps, so those common steps are described first before describing the unique steps taken in each of the four tasks.

Note that all Tasks have the household survey data available. But only Task 1 and Task 2a assume the availability of a representative 10% sample of tax filing units. While Task 2b assumes the availability of just a few aggregate tax statistics provided by the government tax authority. And Task 2c assumes there is no tax information of any type. Except for in Task 2c, the following aggregate tax statistics are known: total tax revenue collected from all tax filers (TAXREV), the number of tax filers (ALLFILERS), the number of tax filers with a tax liability above 20,000 ringgit (FILERS>20), and the number of tax filers with a tax liability above 40,000 ringgit (FILERS>40). Here are those aggregate tax statistics (with revenue in billions of ringgit and filers in millions of individuals):

TAXREV	ALLFILERS	FILERS>20	FILERS>40
26.01	3.861	0.282	0.134

Notice that the mean tax liability among filers is about 6,737 ringgit while only about 7.3% of the filers have a liability above 20,000 and only about 3.5% are above 40,000.

In addition to these four aggregate tax statistics, the tax authority has made it clear that a nontrivial number of filers have a zero tax liability.

The goal of the experimentation described below is to determine which data preparation tools are useful in these four tasks and which tasks produce data sets that are appropriate for the kinds of non-standard analysis situations mentioned above.

We start with descriptions of four data-preparation techniques that have been used in this experimental work. Then we explain four different strategies for combining these techniques to produce data sets suitable for use with the Tax Analyzer.

Techniques

This section describes four data-preparation techniques that have been used in one or more of the data-preparation strategies described below. The techniques are: preparing individual survey data, imputing tax filing status, imputing missing variables, and calibrating the data preparation.

Preparing Individual Survey Data

Preparing the household survey data for use with the Tax Analyzer involves two different activities: identifying survey variables that are conceptually the same as a variable appearing on the tax form, and deciding how to split married couples into two tax filing units (because the Malaysian income tax is personal, not family based).

The 2019 HIS sample provided to the World Bank includes 24,871 households, which is about thirty percent of the households in the full survey sample. The weights in the thirty-percent sample have been adjusted so that it produces national estimates. More importantly, the provided sample contains only four income variables for the household with no information about who in the household received the income. One of the income variables, employment income, is a close match to the employment income variable on the tax form. But the other three household income variables do not correspond to any income variable on the tax form. The household business income variable includes imputed rent on the household's owner-occupied house, which is not taxable. The household property income variable includes interest income and rental income, both of which appear on the tax form, but also includes dividends, which do not appear on the tax form. The fourth household income variable is transfer income, which is not taxable.

In addition to the common employment income variable (which represents the bulk of total survey income), we have been able to construct from the survey data a variable indicating a disabled taxpayer (which is rare) and a variable that approximates the amount of tax relief given for young dependents. So, there are only three variables that are conceptually the same in the household survey data and in the tax return data.

The second survey data preparation activity involves constructing tax filing units from the married-couple households. (In principle, we should split out older children who have a job into their own tax filing units, but as described above the provided survey sample does not indicate who in the household receives income.) We assume that the dependents of a married couple are reported on the tax return for the higher earning individual. And we split household employment income between the two married individuals in different ways depending on whether aggregate tax statistics are available. After splitting the married couples there are 43,626 individual tax filing units.

In Task 2c, where no aggregate tax statistics are available, we assume the household head receives sixty percent of the total household employment income and the spouse receives the remaining forty percent.

In the other three tasks, we split employment income using a simple rule that has three parameters: an household employment income level (LEVEL), a spouse share (SFRACLO) that is used for households with employment income below the LEVEL, and another spouse share (SFRACHI) that is used for households above the LEVEL . The value of SFRACHI (above the LEVEL) is smaller than the value of SFRACLO (below the LEVEL). We explain below how the values of these three employment income splitting parameters are determined in each strategy.

Read the prep.py script for detailed information on how the individual survey data are derived from the household survey data.

Imputing Tax Filing Status

Given that we know a nontrivial number of tax filers have a zero tax liability, the imputation of a filing status variable is more complicated than specifying an income threshold above which individuals are assumed to file. In our experimentation, we have found that it is better to specify a filing probability function (of tax liability) that is piece-wise linear and has four parameters: the filing probability at zero tax liability (FP0), the filing probability at some middle tax liability (FPM), the middle tax liability amount (TXM), and the higher tax liability amount at which the filing probability becomes one (TX1)

Everyone with a simulated tax liability above TX1 is a filer. Everyone below TX1 is a filer only if a random number (uniformly distributed between zero and one) for that individual is less than that individual's filing probability.

Any sample of survey individuals can be used as Tax Analyzer input. By default the Tax Analyzer calculates a tax liability only for individuals that file a return, leaving each nonfiler with a zero tax liability. But that default behavior can be changed so that a tax liability is calculated for every sample individual, which is exactly what is needed if we are to compute a filing probability for each sample individual.

In all tasks except Task 2c, we know the number of tax filers. We can use that number as a target, and find the values of the four filing probability function parameters that cause the simulated number of filers to be close to that target number. The finding of the four parameter values is done by brute force: a sequence of values is specified for each of the four parameters and at each four-dimensional grid point the squared distance between the simulated number of filers and the target number of filers is computed. The grid point with the smallest squared distance specifies the calibrated tax filing probability function.

The sequence of parameter values assumed for each of the four parameters is as follows:

FP0 : [ 0.010 , 0.015 , 0.020 , 0.025 , ... , 0.035 , 0.040 ]
FPM : [ 0.050 , 0.055 , 0.060 , 0.065 , ... , 0.090 , 0.095 ]
TXM : [ 0 , 50 , 100 , 150 , ... , 300 , 350 ]
TX1 : [ 200 , 250 , 300 , 350 , ... , 500 , 550 ]

These four sequences define more than 4000 grid points, each one of which specifies a different tax filing probability function. The brute-force grid search simply identifies the grid point that produces the closest match between (1) the number of filers calculated using the simulated tax liabilities and the filing probability function at that grid point and (2) the target number of filers.

Our experience has been that this simple scheme does a good job at finding tax filing probability functions that have a plausible shape and that hit the target. And while the brute-force grid search is not an elegant algorithm, it does automatically find the parameters of the best filing probability function.

Read the filer.py script for detailed information on how the tax filing status variable is imputed.

Imputing Missing Variables

In two of the four strategies, we use standard missing data techniques to impute the values of missing variables. In Task 1 we impute seven variables to the tax return records from the individual survey data. And in Task 2a, we impute six variables to individual survey records from the tax return data. In both these cases, the missing data pattern is monotone, so a non-iterative univariate version of the MICE algorithm can be used to impute the missing variables sequentially. Multivariate Imputation by Chained Equations (MICE) is widely recognized as the gold standard for dealing with missing data problems. Rather than use the MICE R package, we use the MICE class included in the Tax-Analyzer-Framework, for which documentation and use examples are available.

Read the impute_i2t.py script for detailed information on the Task 1 imputation of data missing in the tax return data, and read the impute2filer.py script for detailed information on the Task 2a imputation of data missing in the individual survey data.

Calibrating the Data Preparation

In all tasks except Task 2c, we calibrate the three employment income splitting parameters and the four filing probability function parameters so that the resulting sample of survey individuals produces simulated tax statistics that match the aggregate tax statistics provided by the government tax authority. The simulated tax statistics are calculated by taking the following steps:

specify values for the three employment income splitting parameters
use the prep.py script to generate a sample of survey individuals
[Tasks 1 and 2c only] use the uwind.py script to convert individual sample to have a uniform weight
use the filer.py script to impute individual tax filing status
[Task 2a only] use the impute2filer.py script to impute tax-return variables to individual filers
use the Tax Analyzer to compute the simulated tax liability of each filer
use the taxstats.py script to tabulate simulated tax statistics

We can iterate through this sequence of steps, varying the three parameter values in the first step so that the simulated statistics tabulated in the last step get closer to the aggregate tax statistics provided by the government tax authority.

As an example, here are the results of this calibration process for Task 2b:

parameters:

SFRACLO	LEVEL	SFRACHI		FP0	FPM	TXM	TX1
0.223	217,000	0.0135		0.035	0.075	100	500

imply:

TAXREV	ALLFILERS	FILERS>20	FILERS>40
26.01	3.861	0.282	0.134

Notice that all four of the simulated tax statistics are the same as the aggregate tax statistics provided by the government tax authority.

Strategies

This section describes four strategies for using household survey data with the Tax Analyzer. The strategies are: merging micro tax data and nonfiler survey data in Task 1, adjusting survey data and imputing tax variables in Task 2a, adjusting survey data using just aggregate tax data in Task 2b, and adjusting survey data without any tax information in Task 2c. Notice that Task 1 is essentially tax return data (with some imputed survey variables) augmented by a subsample of individuals survey data to represent nonfilers. In contrast, the three Task 2 variants are all survey data that have been adjusted in ways that depend on how much information is available about tax returns (ranging from full sample information in 2a, to aggregate information in 2b, to no tax information in 2c).

Task 1

Task 1 uses the survey data to identify individuals who do not file a tax return (by imputing a nonfiler variable), and then merges those survey nonfilers with the tax return sample generating a sample that represents the whole population. Obviously, this approach is feasible only if one has access to a sample of tax returns. This strategy is suitable when the goal is to be able to conduct tax analysis for reforms that expand the number of individuals who would file a tax return. But this strategy is not appropriate for constructing income distribution statistics based on size-adjusted household incomes because there is no way to combine the individual tax returns of a married couple to get their household information.

Task 1 involves the following steps to calibrate the data preparation parameters:

specify values for the three employment income splitting parameters
use the prep.py script to generate a sample of survey individuals
use the uwind.py script to convert individual sample to have a uniform weight
use the filer.py script to impute individual tax filing status
use the Tax Analyzer to compute the simulated tax liability of each filer
use the taxstats.py script to tabulate simulated tax statistics

These parameters:

SFRACLO	LEVEL	SFRACHI		FP0	FPM	TXM	TX1
0.222	215,300	0.0135		0.035	0.060	200	350

imply:

TAXREV	ALLFILERS	FILERS>20	FILERS>40
26.01	3.861	0.282	0.134

Notice that all four of the simulated tax statistics are the same as the aggregate tax statistics provided by the government tax authority.

Then seven survey variables (gender, age, urban-rural, married, property income, business income and imputed rent on owner-occupied home, transfer income), which are drawn from the survey individuals who are simulated to be tax filers, are imputed to the tax return records. This is done using the MICE technique described above.

And finally, the survey individuals who are simulated to be nonfilers are merged with all the tax return data (that, by definition, represent tax filers). The resulting data set contains 832,725 records: 446,640 are from the uniform-weight (of 20) sample of survey individuals who are simulated to be nonfilers and 386,085 are from the 10% sample of all tax returns. This sample represents the whole population, has tax return information on those with high incomes (for accurate analysis of most tax reforms) that has been supplemented with imputed demographic information and non-taxable incomes. And the records that represent nonfilers have employment income and transfer income, which is likely to be the bulk of nonfiler income in most cases.

Task 2a

Like Task 1, Task 2a assumes access to a sample of tax returns, but uses that tax return sample to impute tax-related variables to each individual in the survey sample. The final data set does not contain any tax return records, but six key tax-return variables are imputed to individual survey records that are simulated to be tax filers. The imputed tax-return variables include the three kinds of taxable income and the three largest itemized deduction (or "relief") amounts not included in the household survey. This imputation of the six variables is done using the MICE technique described above.

Task 2a involves the following steps to calibrate the data preparation parameters:

specify values for the three employment income splitting parameters
use the prep.py script to generate a sample of survey individuals
use the filer.py script to impute individual tax filing status
use the impute2filer.py script to impute tax-return variables to individual filers
use the Tax Analyzer to compute the simulated tax liability of each filer
use the taxstats.py script to tabulate simulated tax statistics

These parameters:

SFRACLO	LEVEL	SFRACHI		FP0	FPM	TXM	TX1
0.230	220,000	0.0150		0.015	0.085	150	350

imply:

TAXREV	ALLFILERS	FILERS>20	FILERS>40
26.01	3.861	0.283	0.138

Notice that in this calibration the aggregate tax revenue and the number of files are the same as the aggregate tax statistics provided by the government tax authority (26.01 billion ringgit and 3.861 million people), but the number of filers with a high tax liability are slightly overestimated. The number with a tax liability above 20,000 ringgit is 0.283, which is about 0.4 percent above the 0.282 provided by the government tax authority. And the number with a tax liability above 40,000 ringgit is 0.138, which is about 3.0 percent above the 0.134 provided by the government tax authority.

Our work on Task 2a has revealed that taxable income amounts (for interest income, rental income, and self-employment business income) and "relief" amounts (for life insurance payments, for other insurance payments, and for "lifestyle" expenses) are highly uneven among tax filers with many filers have a zero value. This means that the MICE stochastic imputation procedure can impute noticeably different values to a given individual depending on the stream of random numbers being used in the imputation process. This is not a shortcoming of the MICE technique, but is a feature of the real-world data with which we are dealing. This means that the calibration of the employment income splitting parameters is not as accurate as in the other tasks (where we are not imputing taxable incomes and relief amounts to individual survey records that are tax filers.

The resulting data set contains 43,626 individual survey records and the individuals who are married can be recombined into their households. Despite this feature, the data set is not suitable for computing after-tax income distribution statistics because of the effects of the imputed non-employment incomes. It turns out that a nontrivial number of survey individuals who have low or moderate employment income are imputed to have much larger non-employment incomes, leaving them with a negative after-tax income. The solution to this problem would be survey data that included variables on all type of taxable income, in which case there would be no need to impute those variables. Another solution with the current data would be to pursue a less ambitious strategy: Task 2b.

Task 2b

Task 2b assumes access to only the four aggregate tax statistics described above. Because there is no access to a tax return sample, this approach is forced to rely on employment income in the survey data as the only type of taxable income and is forced to assume the only type of tax relief is based on the number of young dependents.

Task 2b involves the following steps to calibrate the data preparation parameters:

specify values for the three employment income splitting parameters
use the prep.py script to generate a sample of survey individuals
use the filer.py script to impute individual tax filing status
use the Tax Analyzer to compute the simulated tax liability of each filer
use the taxstats.py script to tabulate simulated tax statistics

These parameters:

SFRACLO	LEVEL	SFRACHI		FP0	FPM	TXM	TX1
0.223	217,000	0.0135		0.035	0.075	100	500

imply:

TAXREV	ALLFILERS	FILERS>20	FILERS>40
26.01	3.861	0.282	0.134

Notice that all four of the simulated tax statistics are the same as the aggregate tax statistics provided by the government tax authority.

The resulting data set contains 43,626 individual survey records and the individuals who are married can be recombined into their households. And because no imputed incomes have been used to calculate tax liability, the problems associated with Task 2a data set are not present.

To illustrate how the Task 2b data set could be used to compute income distribution statistics, we use the recombine.py script to construct a household data set that includes income tax liability. As described above, we get household tax liability by splitting households into individual tax filing units, use the Tax Analyzer to compute individual taxes, and then recombine the households that were split adding the taxes of the split individuals. So, for example, using the recombined sample, we can compute full population Gini coefficients for different definitions of household size-adjusted income. Using a simple square-root size adjustment, the incdist.py script produces the following results:

Income definition	Gini coefficient
before-tax, before-transfers	0.4564
before-tax, after-transfers	0.3986
after-tax, after-transfers	0.3823

The modest effect of income taxes on the Gini coefficient stems from the fact that only 3.861 million individuals out of the 12.806 million single householders and married couples (which is only about 30 percent) file a tax return. (That percent would have been even less if the survey data would have permitted identification of older children with earnings.) Another factor that has a bearing on these income distribution statistics is that during the course of this experimental work we learned from World Bank staff in Malaysia that there is employer withholding of income tax and there is an employment income threshold below which an individual is not required to file a tax return. So, many lower income workers are paying some income tax even though they do not file a tax return.

Task 2c

Task 2c assumes no access to aggregate tax statistics and no access to a tax return sample. Given a complete lack of information about taxes, this approach is forced to adjust the survey data using arbitrary assumptions to correct for the under-reporting of high employment incomes, which is common problem with survey data. The adjustment method and adjust assumptions used here are those used in a recent OECD working paper. The method of adjustment has been included as the infer method of the the Tax-Analyzer-Framework IncomeUnderReporting class and has been used in a Monte Carlo study. The assumptions used in the OECD working paper are two-fold: that incomes have a Pareto distribution above the 50th percentile (which is called the lower bound) and that incomes are under-reported, and hence need to be adjusted upwards, above the 90th percentile (which is called the upper threshold). The method infers a Pareto parameter from incomes between the lower bound and the upper threshold, which are assumed to be correctly reported, and uses that inferred Pareto parameters to specify adjusted incomes above the upper threshold.

Because there is no tax information that would allow the calibration of the income splitting parameters, we assume when using the prep.py script that spouse employment incomes are always forty percent of married household employment income. And the lack of tax information also means that we have to skip using the filer.py script to impute a tax filing status for each individual.

Task 2c involves the following steps to produce adjusted individual survey data:

use the prep.py script to generate a sample of survey individuals
use the uwind.py script to convert individual sample to have a uniform weight
use the hisstats.py script to generate employment income statistics
use the adjust_infer.py script correct under-reporting of high incomes

Step 2, which is exactly the same as step 3 in Task 1, is taken because the IncomeUnderReporting class works only for with samples that have sampling weights that are the same across all sample observations. As in Task 1, we convert the variable-weight sample of survey individuals into a larger sample in which each observation has a weight of 20. The raises the number of unweighted observations from 43,626 to 639,399, leaving all the weighted totals unchanged.

The data set produced by Task 2c has the following weighted totals: about 12.81 million individuals (all of which are must be assumed to be tax filers, but only 7.43 million have a positive tax liability) and an enormous aggregate tax revenue of about 180 billion billion ringgit. Yes, you read that correctly: instead of something like 26 billion ringgit, this method and its assumptions produce tax revenue that is about seven billion times larger. Clearly the top ten percent of incomes are being over adjusted by an enormous amount.

Why is the Task 2c method, which here is assuming a lower bound at the 50th percentile of employment income (21,260 ringgit) and an upper threshold for adjusting employment income for under reporting at the 90th percentile of employment income (73,397 ringgit), producing such large high-income adjustments?

The first thing to be said in response to this question is that the 50th percentile bound and 90th percentile threshold are arbitrary assumptions that are not based on any empirical information about Malaysian employment incomes. So, at some level, it should not be surprising that the resulting estimated tax revenue is not close to actual tax revenue.

The second thing to be said is that this method of adjusting under reported high incomes is known to generate incorrect results when there is a violation of the assumption that incomes between the lower bound and upper threshold have a Pareto distribution. For an example of this kind of problem, see the discussion at the top of [this file][dp-p5], which part of the Tax-Analyzer-Framework documentation of its data preparation tools. In particular, the assumption being made here that the Pareto tail of the income distribution starts at the 50th percentile of employment income seems dubious. If that were true, the implied Pareto alpha parameter estimate in each subset of the 50th percentile to 90th percentile range would be about the same. But when we look at the alpha estimates produced by the hisstats.py script, we see that is not even close to being true.

And the third thing to be said in response to this question is that using this infer method with different bound/threshold assumptions can produce high income adjustments that imply aggregate tax revenue roughly equal to 26 billion ringgit. In other words, the problem is not with the infer method but with the 50/90 assumptions. So, for example, if we assume a lower bound equal to the midpoint between the 75th and 80th percentile employment incomes (75m80, which is 46,382 ringgit) and an upper threshold equal to the 99th percentile employment income (which is 176,987 ringgit) produce many fewer high income adjustments and the magnitude of those high income adjustments is much smaller, and hence, the 75m80/99 assumptions produce an aggregate tax liability of 26.03 billion ringgit. But the Task 2c strategy eschews use of any non-survey information when making the high employment income adjustments, so there is no way for it to discover that the 75m80/99 assumptions produce more plausible high income adjustments.

These Task 2c results underscore the risks of trying to adjust survey data without the use of any additional information other than what is in the survey.

Files

Below there is a link to one zip file that contains four zip files, one for each of the above tasks. Each of the four embedded task zip files contains all the computer programs used to generate the data set for that task, and the resulting data set itself. The zip file available for download from this page is encrypted and password protected. Click on this link to download the zip file.

Note that the computer programs in these zip files work only with Tax-Analyzer-Framework version 4.21.0 or later and MYI-Tax-Analyzer version 1.30.0 or later.