Sampling Complete Data to Speed Up Model Execution

Consider an enviable situation in which the country's government has supplied all the personal income tax (PIT) returns for a recent year and that these tax return data are complete. That is, the data contain all the information needed to go through the process of filling out a tax return and determining each filer's tax liability. And furthermore, assume that the government is satisfied using just current taxpayers to estimate the effects of tax reforms that they are considering. The only data preparation problem is that the number of tax returns supplied is so large that model execution times are slow and that the size of the model's dump output files are so large as to make post-simulation analysis unwieldy. This problem occurred when developing a Malaysia PIT model and the government supplied all (nearly four million) PIT records for 2018.

Solution Strategy

The obvious solution to this problem is to draw a random sample from the full set of tax returns. But the key questions are what kind of random sample and how large a sample should be drawn.

A random sample with a single sampling probability is not a good solution because tax revenue is generated mostly from the rich, and this is particularly true when the tax rate schedule is progressive, as it is in Malaysia. The better solution is to divide the full set of tax returns into income strata and sample the high-income strata at higher rates than the low-income strata. This is a kind of importance sampling that over represents those with high incomes and under represents (with higher sampling weights) those with lower incomes. Conducting this stratified sampling will ensure that aggregate tax revenue is accurately represented by even a relatively small sample.

Solution Results

Specifying three income strata and sampling probabilities of 0.05, 0.25, and 1.00 (that is, including all the returns in the highest income stratum) produces samples containing slightly less than ten percent of the tax returns (and hence, reduces model run time by roughly ninety percent) yet produces revenue estimates by decile and in aggregate that are close to those produced by the model when using the full data set as input. This was done with 2018 Malaysia PIT data before the development of the tools discussed in the following section. The details of that earlier custom work are of mostly historical interest.

Solution Methods

A detailed example of drawing a stratified random sample consisting of about ten percent of the total number of Malaysia PIT returns is discussed in Problem 6 of the data preparation tools documentation. This example discusses a Python script that uses the stratified_sample method of the Tax Analyzer Framework's SampleReWeighting class to conduct the sampling. Notice that the sampling is conducted by specifying a sampling probability for tax returns in each of three income strata. Then a stream of random numbers (uniformly distributed between zero and one) is generated and a tax return is selected for the sample only if its random number is less than its sampling probability. This method of generating a sample means that each random number seed (that is, the starting point of the stream of random numbers) will generate a different sample.

The key questions are how many income strata to specify and what the sampling probability to set for each stratum. These two decisions will jointly determine the overall sample size. The best answers to these questions depend on the desired precision of sample-generated tax revenue estimates. In the case of Malaysia, people involved in analyzing tax reforms typically describe the revenue effect of a reform using amounts that are expressed to the nearest one-tenth of a billion ringgit: for example, a reform is said to increase revenue by 1.2 billion ringgit. In this context, it would be unnecessary to require a sample to generate revenue that was only 0.001 billion ringgit different from the revenue generated by using the full set of tax returns. The tool cited in the prior paragraph can be used to experiment with different numbers of strata and sampling probabilities, using each generated sample as model input to see how sensitive tax revenue estimates are to the random number seed. Read Problem 6 for more details.

Tax-Analyzer-Framework

Tax Return Data Preparation Guide

Sampling Complete Data to Speed Up Model Execution

Solution Strategy

Solution Results

Solution Methods