MYI-Tax-Analyzer: Data Preparation

Tax Data Preparation Overview

This document provides an overview of how the three CSV-formatted files required by the model have been prepared. The methods used in this initial preparation are simple and can be refined later after gaining experience using the model. This overview contains several sections:

Preparing Tax Data for 2018
Missing Tax Data for 2018
Sampling Tax Data for 2018
Extrapolating Tax Data beyond 2018

Preparing Tax Data for 2018

The Ministry of Finance have provided complete personal income tax (PIT) return data for 2018: 3,116,422 Form BE returns and 744,425 Form B returns for a total of 3,860,847 returns. Returns for each Form are provided in two files: one containing returns that were filed on-time and another containing returns that were filed late.

The merging of the late and on-time files for each Form and the creation of a late variable is accomplished by the preprep.py script, which writes several intermediate data files for additional processing.

Variables on Form BE and Form B have been mapped into a common set of variables and segregated into read variables (that are required model input) and calc variables (that are calculated by the model). This data-preparation step is accomplished by the prep.py script, which writes several intermediate data files for additional processing.

The final step in the data-preparation process is to merge the data from the two Forms (which are now represented using a single set of variable names) into a single data frame and then write that data frame to a CSV-formatted file. This final step is accomplished by the merge.py script. This script writes both a data file containing the 2018 values of the read variables for each return included in the sample and a weights file containing the sampling weights for each return included in the sample for each year in the model projection period. The script also writes a data and weights file contain all the returns.

Missing Tax Data for 2018

A few of the returns contain data used as model input that are clearly inconsistent with other data on the return. But the number of returns with such inconsistencies and their monetary amounts are very small. The small number and size of these data inconsistencies implies that imputing missing data is a low-priority activity that could be undertaken at a later time. This section simply documents these inconsistencies.

By inconsistency, we mean returns where the components of a form total are all zero but the form total is positive. Such cases could be fixed by imputing a value to one of the components equal to the form total. Again, this imputation has not been done because the benefits of doing so seem tiny.

The following tables show the number of returns with and the aggregate size (in billion of ringgit) of inconsistencies among the Form B and Form BE returns.

Form B Inconsistencies (744,425 total returns):

Variable	Number	Size (RMb)
businc_net	116	0.007
agginc	42	0.002
gift_total	86	0.000
relief_total	5777	0.049

Form BE Inconsistencies (3,116,422 total returns):

Variable	Number	Size (RMb)
businc_net	0	0.000
agginc	369	0.026
gift_total	549	0.001
relief_total	423	0.003

Sampling Tax Data for 2018

In the final data-preparation step, the merge.py script writes data and weights files for two different samples: a large complete sample containing all the returns and a stratified random sample containing roughly one-tenth of the total number of returns. Having both sets of these data/weights files enables the model to be run using either the complete sample or the random sample.

The random sample was drawn using different sampling probabilities in each of three strata, which are defined by gross income (which is all income reported on the tax form ignoring current and prior year business losses). The gross income floor and ceiling and the sampling probability for each of the three strata are as follows:

Stratum	Floor	Ceiling	Probability
1	0	100,000	0.05
2	100,000	500,000	0.25
3	500,000	infinity	1.00

Of course, these sampling assumptions can be changed if necessary.

Using the above stratified sampling assumptions produces a sample with these return counts:

Stratum	AllData	SampledData
1	3,130,612	156,531
2	688,648	172,162
3	41,587	41,587
Total	3,860,847	370,280

Each of the sampled returns is assigned a 2018 weight equal to the inverse of its sampling probability. So, for example, each sampled return in stratum 2 is assigned a weight of 4.

This sample produces model results that are virtually indistinguishable from the results produced when using the complete sample, and the model runs much faster when using the sample because it has to do less than one-tenth the number of tax calculations.

So, for example, when using the complete sample, the model produces (in 34.0 seconds) an aggregate 2018 PIT liability estimate of 26.0497 billion ringgit for current-law policy. When using the random sample described above, the model produces (in 2.9 seconds) an aggregate 2018 PIT liability estimate of 26.1002 billion ringgit. We can compare these two model estimates with the total PIT liability reported on the forms (total tax charged minus credits), which is 26.0971 billion ringgit. All three numbers are quite close: the model estimate using the random sample is only about 0.01 percent higher than the total tax reported on all the forms.

Extrapolating Tax Data beyond 2018

The prior section describes how the 2018 input data file and the 2018 weights have been generated. In order to use the model in years after 2018, the weights and values of each monetary variable need to be extrapolated to subsequent years.

There are many sensible ways of doing this. We start the modeling process by specifying a simple set of extrapolation assumptions that have been made with the idea that they would be easy to explain and would be modified as experience with the model is gained.

The weights are assumed to grow at the same rate as the number of employed people is projected to grow in recent MOF projections. The projected rates are in the grow.csv file under the heading Lgrate. All monetary variables are assumed to grow each year at the same rate as nominal per-worker GDP grows. The projected growth rates in real GDP per worker and in the GDP deflator are in the grow.csv file under the headings Qgrate and Pgrate, respectively. These projected rates are converted into growth factors and written to the growth_factors.csv file by the grow.py script.

After gaining experience using the model, these simple assumptions could be made more complicated. And certainly, as time passes and the Ministry of Finance revises its baseline macroeconomic projections, the values in the grow.csv file will be updated.

In early April, 2021, the baseline growth factors were revised to reflect a new Ministry of Finance macroeconomic projection.

And again in April, 2023, the baseline growth factors were revised to reflect a new Ministry of Finance macroeconomic projection.