Tax-Analyzer-Framework

Income Survey Data Preparation Guide
Login

Adjusting Data to Generate Known Aggregate Totals

Consider a situation in which the country's government has not made available tax return data, but there are income data from a recent sample survey of the country's population. In addition, there are several basic aggregate statistics about the population and personal income tax (PIT) revenues for the survey year. This kind of problem was encountered when developing a PIT microsimulation model for Ethiopia.

Solution Strategy

Given the scarcity of data, the best that can be done is to pursue a two-step strategy: (1) translate the income survey data into pseudo tax return data, and (2) use simple calibration methods to adjust the pseudo weights and variables so that the adjusted pseudo data can generate the known aggregate statistics.

The first translation step involves breaking households into pseudo tax filing units because the tax unit is the person rather than the family. Then survey income variables are combined to represent (as closely as possible) taxable income variables. The Ethiopia PIT uses the same progressive rate structure to tax separately employment income (that is, wages and salaries), building rent, and self-employment income. A number of other types of income are taxed at various flat rates.

After constructing these pseudo tax filing units and assuming everyone with positive income files, the microsimulation model produces total population that is too small, employment-income revenue that is too small, building-rent revenue that is too small, and self-employment-income revenue that is too large, all relative to the known aggregate statistics.

The second adjustment step involves calibrating the value of each of three adjustment parameters: the URBAN_WEIGHT_SCALING_FACTOR, the URBAN_WAGEINC_SCALING_FACTOR, and the SEMPINC_NONFILER_THRESHOLD. The multiplicative weight factor is applied to the sampling weight of each urban tax filing unit. The multiplicative wageinc factor is applied to the employment income of each urban tax filing unit. And the threshold defines the self-employment income below which pseudo tax units with positive self-employment income are assumed to not file.

Applying these three adjustment factors sequentially is a minimally-intrusive adjustment procedure. The weight factor is used to hit the total population target, the wageinc factor is used to hit the target revenue from the taxation of employment income, and the threshold is used to hit the target revenue from the taxation of self-employment income. The simple sequential nature of the calibration means that it can be done by hand without using any numerical algorithms as described in the Solution Methods section below.

Solution Results

Solving this problem in the manner described above produced a plausible data set that could be used as input for the Ethiopia PIT microsimulation model.

After the first translation step, the unadjusted input data files produced the following results for 2019 population estimates (in millions of individuals) and 2019 aggregate PIT revenue estimates (in billions of Ethiopian birr):

Statistic Model Estimate Aggregate Statistic
Population 89.964 112.079
Tax Filers 30.223 ?
Tax Category Model Estimate Aggregate Statistic
employment income 14.613 41.202
building rent 1.510 2.138
self-employment income 19.254 14.738
other income 0.020 ?

The calibration process produced these values for the adjustment factors:

Adjustment Parameter Unadjusted Value Adjusted Value
URBAN_WEIGHT_SCALING_FACTOR 1.0 1.9308
URBAN_WAGEINC_SCALING_FACTOR 1.0 1.3380
SEMPINC_NONFILER_THRESHOLD 0 80,000

After applying the above factors, the adjusted input data files produced the following results:

Statistic Model Estimate Aggregate Statistic
Population 112.079 112.079
Tax Filers 21.579 ?
Tax Category Model Estimate Aggregate Statistic
employment income 41.201 41.202
building rent 2.475 2.138
self-employment income 14.917 14.738
other income 0.026 ?

The tables above show that the adjusted input files generate model estimates that are fairly close to the administrative totals.

Solution Methods

Details of the first data translation step for 2019 Ethiopia income survey data are available here. And the details of the second data adjustment process are described here.

The calibration of the three adjustment factors uses a numerical technique called root finding. The process of generating adjusted pseudo tax return data from the income survey data, and then using those data as microsimulation model input to generate the aggregate statistics for which we have known values, can be thought of as a simple mathematical function with one input variable (the adjustment factor) and one output value (the simulated statistic). We can try a number of values for the input variable looking for the value that reduces the difference between the simulated and known values of the aggregate statistic. The value of the input variable that generates a zero difference is called the root of the function.

The basic techniques for root finding are simple. Find an input variable value that produces a simulated value of the statistic that is below the known value of the statistic, and find a second input variable value that produces a simulated value of the statistic that is above the known value. Once that is done, the root has been bracketed. Finding the root involves reducing the width of the bracket until the value of the input variable produces a simulated value of the statistic that is as close to the known value as desired. The simplest way of doing this is to repeatedly try a new input variable value that is half way between the brackets, and then use this new value to replace one of the brackets (depending on whether its simulated value is above or below the known value). For more on root-finding methods, consult chapter 9 of Press, et al., Numerical Recipes: The Art of Scientific Computing, Third Edition (Cambridge University Press, 2007).