ETI-Tax-Analyzer: Data Preparation

Tax Data Preparation Overview

This document provides an overview of how the CSV-formatted input files required by the model have been prepared. The methods used in this initial preparation are simple and can be refined later after gaining experience using the model. This overview contains several sections:

Source of Income Tax Information for 2019
Source of Model Input Data for 2019
Preparing Model Input Data for 2019
--- Unadjusted Input Data for 2019
--- Adjusted Input Data for 2019
Extrapolating Input Data beyond 2019

Source of Income Tax Information for 2019

Our understanding of the Ethiopia personal income tax has be gained from reading several documents, the most important of which is a recent Institute for Fiscal Studies paper by Tom Harris and Edris Seid, entitled 2019/20 survey of the Ethiopian tax system. In addition to a helpful description of the personal income tax, this paper contains some aggregate totals for 2019 tax revenue in several different categories. These aggregates totals, which were obtained from the Ethiopia Ministry of Finance, will be used in the data preparation process described below.

Also, the KPGM 2019 fiscal guide contains a useful summary of the four Ethiopia personal income tax schedules.

Tax Category	Income Type	Tax Schedule
Schedule A	employment income	progressive rate schedule on monthly income
Schedule B	net building rent	progressive rate schedule on annual income
Schedule C	net business income	progressive rate schedule on annual income
Schedule D	other income	various flat tax rates

The progressive rate schedules are the same for all income types: there is a zero rate bracket, followed by six brackets with rates increasing from 10% to 35% in five-percentage-point increments. Each type of income is taxed in isolation from the other types, so, for example, business losses cannot be used to reduce other positive incomes.

After an investigation into whether employment (that is, wage and salary) income could be estimated only for the months that individuals were actually working, it was concluded that the survey data could not support this kind of detail. As a result, annual employment income is used to simulation Schedule A tax liability.

Source of Model Input Data for 2019

The ultimate source of model input data is the Ethiopia Socioeconomic Survey 2018-2019. Rather than using the original survey data, we have used a harmonized version of these data that appear in the UN Food and Agriculture Organization (FAO) Rural Livelihoods Information System, which is known as RuLIS. In order to understand how the survey variables are coded in the two confidential (household and individual) Stata files, we used the publicly available RuLIS codebook. Also, an FAO country brief based on these survey data was useful.

Preparing Model Input Data for 2019

The step in preparing survey data for use as tax model input is to map variables in the survey data into variables that can be used as input for the tax model. The following table show the mapping used here, where the sum of the survey variables in each tax category are set equal to the model variable.

Tax Category	Survey Variable(s)	Model Variable
Schedule A	`wge1`, `wge2`, `wge3`, `wge4`	`wageinc`
Schedule B	`nofarmrnt`	`rentinc`
Schedule C	`onfarmincome`, `selfemp`	`sempinc`
Schedule D	`otherinc`	`finainc`

The wage variables in the survey data are for the agricultural sector and for three non-agricultural sectors, all four of which are reported for each individual in the survey sample. The other survey variables are reported only at the household level. When there are both adult males and adult females in the household, the household amount is evenly split among the two oldest household members, otherwise the complete household amount is assigned to the oldest household member. Notice that the survey data include both self-employment income from farming and self-employment income from non-agricultural business. Given the limitation of the survey data, the only components of Schedule D income available were income from financial assets.

In the rest of this section we describe the simulated tax results from using the survey data constructed as described above. These unadjusted input data are in the raw.csv and raw_weights.csv files. The aggregate population and personal income tax revenue estimates generated using these unadjusted data do not always come close to the aggregate values we know from administrative sources. So, we then use the same variable mapping and add some adjustments to sampling weights and variable values to generate adjusted input data that are in the pop.csv and pop_weights.csv files. This section concludes with a comparison of the aggregate population and personal income tax revenue estimates generated using these adjusted data with those from administrative sources.

In addition to the above income variables, the model input data includes three demographic variables (none of which are involved in income tax calculations): an integer age variable, an urban (rather than rural) dummy variable, and a female (rather than male) dummy variable. Also included, and not used in tax calculations, are three household variables: the id of the individual's household (hhid), the number of people in the individual's household (hhsize), and the adult composition of the individual's household (hhcomp). These three household variables are included to facilitate post-simulation analysis conducted at the household level.

And finally, each of the two sets of input variables includes a dummy variable called nonfiler, which will be explained below. Both sets of input data files contain 28,719 individuals (many of whom are children) living in 6770 households.

The details of the data preparation process can be found in the data/prep.py script.

--- Unadjusted Input Data for 2019

The unadjusted input files are created by using the variable mapping described above and making no other changes to the individual variable amounts (in the raw.csv file) or to the individual sampling weights (in the raw_weights.csv file). The nonfiler variable is defined so that it equals one if the individual has no positive income of any type and equals zero if the individual has some positive income. Only individuals with a nonfiler value of zero are assumed to file and pay income taxes.

Using these unadjusted input data files produces the following 2019 population estimates (in millions of individuals) and 2019 aggregate personal income tax revenue estimates (in billions of Ethiopian birr):

Statistic	Model Estimate	Administrative Total
Population	89.964	112.079
Tax Filers	30.223	?

Tax Category	Model Estimate	Administrative Total
Schedule A	14.613	41.202
Schedule B	1.510	2.138
Schedule C	19.254	14.738
Schedule D	0.020	?

The administrative population total is an official United Nations statistic and the administrative aggregate tax revenues are from the IFS paper. The Schedule D administrative total for the personal income tax is not known because the total in the IFS paper includes other types of taxes.

These tables show three substantial differences between the unadjusted estimates and the administrative totals.

First, the model population estimate is too low, being only about 80% of the actual population.

Second, the model estimated Schedule A (wage and salary income) tax revenue is too low, being only about 35% of the administrative total.

And third, the model estimated Schedule C (self-employment income) tax revenue is too large, being about 30% larger that the administrative total.

In the next section, we discuss the three assumptions that we make to generate adjusted model input files, and show how the adjusted input files generate model estimates that are much closer to administrative totals.

--- Adjusted Input Data for 2019

The minimally-intrusive adjustment procedure involves a multiplicative scaling up of urban weights and urban wageinc amounts, and then establishing a positive sempinc threshold below which individuals are assumed to be nonfilers. This approach requires the least adjustment of the raw data because, even though the urban population is smaller than the rural population, the urban population has the vast bulk of wage and salary income. The values of the adjustment parameters used to generate the adjusted input files are shown in the following table:

Adjustment Parameter	Unadjusted Value	Adjusted Value
`URBAN_WEIGHT_SCALING_FACTOR`	1.0	1.9308
`URBAN_WAGEINC_SCALING_FACTOR`	1.0	1.3380
`SEMPINC_NONFILER_THRESHOLD`	0	80,000

The threshold value in the above table is expressed in Ethiopian birr (ETB), which had an exchange rate against the US dollar that, over the course of 2019, started at 0.0356, ended at 0.0313, and averaged about 0.0343. This means that the SEMPINC_NONFILER_THRESHOLD value was equivalent to roughly 2750 US dollars.

Using the input data files that have been adjusted using the above parameters, produces the following 2019 population estimates (in millions of individuals) and 2019 aggregate personal income tax revenue estimates (in billions of Ethiopian birr):

Statistic	Model Estimate	Administrative Total
Population	112.079	112.079
Tax Filers	21.579	?

Tax Category	Model Estimate	Administrative Total
Schedule A	41.201	41.202
Schedule B	2.475	2.138
Schedule C	14.917	14.738
Schedule D	0.026	?

The tables above show that the adjusted input files generate model estimates that are fairly close to the administrative totals. Further experience with the model, or the availability of additional administrative totals, may suggest a better adjustment procedure.

Extrapolating Input Data beyond 2019

The prior sections describe how the 2019 input data files and the 2019 weights files have been generated. In order to use the model in years after 2019, the weights and values of each monetary variable need to be extrapolated to subsequent years.

There are many sensible ways of doing this. We start the modeling process by specifying a simple set of extrapolation assumptions that have been made with the idea that they would be easy to explain and would be modified as experience with the model is gained.

The weights are assumed to grow at the same rate as the number of people is projected to grow in a recent UN population projection (see below). The projected rates are in the grow.csv file under the heading Lgrate. All monetary variables are assumed to grow each year at the same rate as nominal per-worker GDP grows. The projected growth rates in real GDP per capita and in a price deflator are in the grow.csv file under the headings Qgrate and Pgrate, respectively. These projected rates, which are from the African Economic Outlook 2022 (see below), are converted into growth factors and written to the growth_factors.csv file by the grow.py script.

After gaining experience using the model, these simple assumptions could be made more complicated. And certainly, as time passes and baseline macro projections change, the values in the grow.csv file will be updated.

UN Population Projection: Ethiopia

UN-pop-projection

African Economic Outlook 2022: Ethiopia

AEO-econ-projection