Tax Data Preparation Overview
This document provides an overview of how the three CSV-formatted files required by the model have been prepared. The methods used in this initial preparation are simple and can be refined later after gaining experience using the model. This overview contains several sections:
- Preparing Tax Data for 2018
- Missing Tax Data for 2018
- Sampling Tax Data for 2018
- Extrapolating Tax Data beyond 2018
Preparing Tax Data for 2018
The Ministry of Finance have provided complete personal income tax (PIT) return data for 2018: 3,116,422 Form BE returns and 744,425 Form B returns for a total of 3,860,847 returns. Returns for each Form are provided in two files: one containing returns that were filed on-time and another containing returns that were filed late.
The merging of the late and on-time files for each Form and the
creation of a late
variable is accomplished by the
preprep.py
script, which writes several
intermediate data files for additional processing.
Variables on Form BE and Form B have been mapped into a common set of
variables and segregated into read variables (that are required model
input) and calc variables (that are calculated by the model). This
data-preparation step is accomplished by the
prep.py
script, which writes several
intermediate data files for additional processing.
The final step in the data-preparation process is to merge the data
from the two Forms (which are now represented using a single set of
variable names) into a single data frame and then write that data
frame to a CSV-formatted file. This final step is accomplished by
the merge.py
script. This script writes
both a data file containing the 2018 values of the read variables for
each return included in the sample and a weights file containing the
sampling weights for each return included in the sample for each year
in the model projection period. The script also writes a data and
weights file contain all the returns.
Missing Tax Data for 2018
A few of the returns contain data used as model input that are clearly inconsistent with other data on the return. But the number of returns with such inconsistencies and their monetary amounts are very small. The small number and size of these data inconsistencies implies that imputing missing data is a low-priority activity that could be undertaken at a later time. This section simply documents these inconsistencies.
By inconsistency, we mean returns where the components of a form total are all zero but the form total is positive. Such cases could be fixed by imputing a value to one of the components equal to the form total. Again, this imputation has not been done because the benefits of doing so seem tiny.
The following tables show the number of returns with and the aggregate size (in billion of ringgit) of inconsistencies among the Form B and Form BE returns.
Form B Inconsistencies (744,425 total returns):
Variable | Number | Size (RMb) |
---|---|---|
businc_net | 116 | 0.007 |
agginc | 42 | 0.002 |
gift_total | 86 | 0.000 |
relief_total | 5777 | 0.049 |
Form BE Inconsistencies (3,116,422 total returns):
Variable | Number | Size (RMb) |
---|---|---|
businc_net | 0 | 0.000 |
agginc | 369 | 0.026 |
gift_total | 549 | 0.001 |
relief_total | 423 | 0.003 |
Sampling Tax Data for 2018
In the final data-preparation step, the merge.py
script writes data
and weights files for two different samples: a large complete
sample containing all the returns and a stratified random
sample containing roughly one-tenth of the total number of
returns. Having both sets of these data/weights files enables the
model to be run using either the complete sample or the random sample.
The random sample was drawn using different sampling probabilities in each of three strata, which are defined by gross income (which is all income reported on the tax form ignoring current and prior year business losses). The gross income floor and ceiling and the sampling probability for each of the three strata are as follows:
Stratum | Floor | Ceiling | Probability |
---|---|---|---|
1 | 0 | 100,000 | 0.05 |
2 | 100,000 | 500,000 | 0.25 |
3 | 500,000 | infinity | 1.00 |
Of course, these sampling assumptions can be changed if necessary.
Using the above stratified sampling assumptions produces a sample with these return counts:
Stratum | AllData | SampledData |
---|---|---|
1 | 3,130,612 | 156,531 |
2 | 688,648 | 172,162 |
3 | 41,587 | 41,587 |
Total | 3,860,847 | 370,280 |
Each of the sampled returns is assigned a 2018 weight equal to the inverse of its sampling probability. So, for example, each sampled return in stratum 2 is assigned a weight of 4.
This sample produces model results that are virtually indistinguishable from the results produced when using the complete sample, and the model runs much faster when using the sample because it has to do less than one-tenth the number of tax calculations.
So, for example, when using the complete sample, the model produces (in 34.0 seconds) an aggregate 2018 PIT liability estimate of 26.0497 billion ringgit for current-law policy. When using the random sample described above, the model produces (in 2.9 seconds) an aggregate 2018 PIT liability estimate of 26.1002 billion ringgit. We can compare these two model estimates with the total PIT liability reported on the forms (total tax charged minus credits), which is 26.0971 billion ringgit. All three numbers are quite close: the model estimate using the random sample is only about 0.01 percent higher than the total tax reported on all the forms.
Extrapolating Tax Data beyond 2018
The prior section describes how the 2018 input data file and the 2018 weights have been generated. In order to use the model in years after 2018, the weights and values of each monetary variable need to be extrapolated to subsequent years.
There are many sensible ways of doing this. We start the modeling process by specifying a simple set of extrapolation assumptions that have been made with the idea that they would be easy to explain and would be modified as experience with the model is gained.
The weights are assumed to grow at the same rate as the number of
employed people is projected to grow in recent MOF projections. The
projected rates are in the grow.csv
file
under the heading Lgrate
. All monetary variables are assumed to
grow each year at the same rate as nominal per-worker GDP grows. The
projected growth rates in real GDP per worker and in the GDP deflator
are in the grow.csv
file under the headings
Qgrate
and Pgrate
, respectively. These projected rates are
converted into growth factors and written to the
growth_factors.csv
file
by the grow.py
script.
After gaining experience using the model, these simple assumptions
could be made more complicated. And certainly, as time passes and the
Ministry of Finance revises its baseline macroeconomic projections,
the values in the grow.csv
file will be
updated.
In early April, 2021, the baseline growth factors were revised to reflect a new Ministry of Finance macroeconomic projection.
And again in April, 2023, the baseline growth factors were revised to reflect a new Ministry of Finance macroeconomic projection.