Tax Data Preparation Overview
This document provides an overview of how the CSV-formatted input files required by the model have been prepared. The methods used in this initial preparation are simple and can be refined later after gaining experience using the model. This overview contains several sections:
- Source of Income Tax Information for 2019
- Source of Model Input Data for 2019
- Preparing Model Input Data for 2019
- --- Unadjusted Input Data for 2019
- --- Adjusted Input Data for 2019
- Extrapolating Input Data beyond 2019
Source of Income Tax Information for 2019
Our understanding of the Ethiopia personal income tax has be gained from reading several documents, the most important of which is a recent Institute for Fiscal Studies paper by Tom Harris and Edris Seid, entitled 2019/20 survey of the Ethiopian tax system. In addition to a helpful description of the personal income tax, this paper contains some aggregate totals for 2019 tax revenue in several different categories. These aggregates totals, which were obtained from the Ethiopia Ministry of Finance, will be used in the data preparation process described below.
Also, the KPGM 2019 fiscal guide contains a useful summary of the four Ethiopia personal income tax schedules.
Tax Category | Income Type | Tax Schedule |
---|---|---|
Schedule A | employment income | progressive rate schedule on monthly income |
Schedule B | net building rent | progressive rate schedule on annual income |
Schedule C | net business income | progressive rate schedule on annual income |
Schedule D | other income | various flat tax rates |
The progressive rate schedules are the same for all income types: there is a zero rate bracket, followed by six brackets with rates increasing from 10% to 35% in five-percentage-point increments. Each type of income is taxed in isolation from the other types, so, for example, business losses cannot be used to reduce other positive incomes.
After an investigation into whether employment (that is, wage and salary) income could be estimated only for the months that individuals were actually working, it was concluded that the survey data could not support this kind of detail. As a result, annual employment income is used to simulation Schedule A tax liability.
Source of Model Input Data for 2019
The ultimate source of model input data is the Ethiopia Socioeconomic Survey 2018-2019. Rather than using the original survey data, we have used a harmonized version of these data that appear in the UN Food and Agriculture Organization (FAO) Rural Livelihoods Information System, which is known as RuLIS. In order to understand how the survey variables are coded in the two confidential (household and individual) Stata files, we used the publicly available RuLIS codebook. Also, an FAO country brief based on these survey data was useful.
Preparing Model Input Data for 2019
The step in preparing survey data for use as tax model input is to map variables in the survey data into variables that can be used as input for the tax model. The following table show the mapping used here, where the sum of the survey variables in each tax category are set equal to the model variable.
Tax Category | Survey Variable(s) | Model Variable |
---|---|---|
Schedule A | wge1 , wge2 , wge3 , wge4 |
wageinc |
Schedule B | nofarmrnt |
rentinc |
Schedule C | onfarmincome , selfemp |
sempinc |
Schedule D | otherinc |
finainc |
The wage variables in the survey data are for the agricultural sector and for three non-agricultural sectors, all four of which are reported for each individual in the survey sample. The other survey variables are reported only at the household level. When there are both adult males and adult females in the household, the household amount is evenly split among the two oldest household members, otherwise the complete household amount is assigned to the oldest household member. Notice that the survey data include both self-employment income from farming and self-employment income from non-agricultural business. Given the limitation of the survey data, the only components of Schedule D income available were income from financial assets.
In the rest of this section we describe the simulated tax results from
using the survey data constructed as described above. These
unadjusted input data are in the raw.csv
and raw_weights.csv
files. The aggregate population and personal income tax revenue
estimates generated using these unadjusted data do not always come
close to the aggregate values we know from administrative sources.
So, we then use the same variable mapping and add some adjustments to
sampling weights and variable values to generate adjusted input data
that are in the pop.csv
and pop_weights.csv
files. This section
concludes with a comparison of the aggregate population and personal
income tax revenue estimates generated using these adjusted data with
those from administrative sources.
In addition to the above income variables, the model input data
includes three demographic variables (none of which are involved in
income tax calculations): an integer age
variable, an urban
(rather than rural) dummy variable, and a female
(rather than male)
dummy variable. Also included, and not used in tax calculations, are
three household variables: the id of the individual's household
(hhid
), the number of people in the individual's household
(hhsize
), and the adult composition of the individual's household
(hhcomp
). These three household variables are included to
facilitate post-simulation analysis conducted at the household level.
And finally, each of the two sets of input variables includes a dummy
variable called nonfiler
, which will be explained below. Both sets
of input data files contain 28,719 individuals (many of whom are
children) living in 6770 households.
The details of the data preparation process can be found in the
data/prep.py
script.
--- Unadjusted Input Data for 2019
The unadjusted input files are created by using the variable mapping
described above and making no other changes to the individual variable
amounts (in the raw.csv
file) or to the individual sampling weights
(in the raw_weights.csv
file). The nonfiler
variable is defined
so that it equals one if the individual has no positive income of any
type and equals zero if the individual has some positive income. Only
individuals with a nonfiler
value of zero are assumed to file and
pay income taxes.
Using these unadjusted input data files produces the following 2019 population estimates (in millions of individuals) and 2019 aggregate personal income tax revenue estimates (in billions of Ethiopian birr):
Statistic | Model Estimate | Administrative Total |
---|---|---|
Population | 89.964 | 112.079 |
Tax Filers | 30.223 | ? |
Tax Category | Model Estimate | Administrative Total |
---|---|---|
Schedule A | 14.613 | 41.202 |
Schedule B | 1.510 | 2.138 |
Schedule C | 19.254 | 14.738 |
Schedule D | 0.020 | ? |
The administrative population total is an official United Nations statistic and the administrative aggregate tax revenues are from the IFS paper. The Schedule D administrative total for the personal income tax is not known because the total in the IFS paper includes other types of taxes.
These tables show three substantial differences between the unadjusted estimates and the administrative totals.
First, the model population estimate is too low, being only about 80% of the actual population.
Second, the model estimated Schedule A (wage and salary income) tax revenue is too low, being only about 35% of the administrative total.
And third, the model estimated Schedule C (self-employment income) tax revenue is too large, being about 30% larger that the administrative total.
In the next section, we discuss the three assumptions that we make to generate adjusted model input files, and show how the adjusted input files generate model estimates that are much closer to administrative totals.
--- Adjusted Input Data for 2019
The minimally-intrusive adjustment procedure involves a multiplicative
scaling up of urban weights and urban wageinc
amounts, and then
establishing a positive sempinc
threshold below which individuals
are assumed to be nonfilers. This approach requires the least
adjustment of the raw data because, even though the urban population
is smaller than the rural population, the urban population has the
vast bulk of wage and salary income. The values of the adjustment
parameters used to generate the adjusted input files are shown in the
following table:
Adjustment Parameter | Unadjusted Value | Adjusted Value |
---|---|---|
URBAN_WEIGHT_SCALING_FACTOR |
1.0 | 1.9308 |
URBAN_WAGEINC_SCALING_FACTOR |
1.0 | 1.3380 |
SEMPINC_NONFILER_THRESHOLD |
0 | 80,000 |
The threshold value in the above table is expressed in Ethiopian birr
(ETB), which had an exchange rate against the US dollar that, over the
course of 2019, started at 0.0356, ended at 0.0313, and averaged about
0.0343. This means that the SEMPINC_NONFILER_THRESHOLD
value was
equivalent to roughly 2750 US dollars.
Using the input data files that have been adjusted using the above parameters, produces the following 2019 population estimates (in millions of individuals) and 2019 aggregate personal income tax revenue estimates (in billions of Ethiopian birr):
Statistic | Model Estimate | Administrative Total |
---|---|---|
Population | 112.079 | 112.079 |
Tax Filers | 21.579 | ? |
Tax Category | Model Estimate | Administrative Total |
---|---|---|
Schedule A | 41.201 | 41.202 |
Schedule B | 2.475 | 2.138 |
Schedule C | 14.917 | 14.738 |
Schedule D | 0.026 | ? |
The tables above show that the adjusted input files generate model estimates that are fairly close to the administrative totals. Further experience with the model, or the availability of additional administrative totals, may suggest a better adjustment procedure.
Extrapolating Input Data beyond 2019
The prior sections describe how the 2019 input data files and the 2019 weights files have been generated. In order to use the model in years after 2019, the weights and values of each monetary variable need to be extrapolated to subsequent years.
There are many sensible ways of doing this. We start the modeling process by specifying a simple set of extrapolation assumptions that have been made with the idea that they would be easy to explain and would be modified as experience with the model is gained.
The weights are assumed to grow at the same rate as the number of
people is projected to grow in a recent UN population projection (see
below). The projected rates are in the
grow.csv
file under the heading Lgrate
.
All monetary variables are assumed to grow each year at the same rate
as nominal per-worker GDP grows. The projected growth rates in real
GDP per capita and in a price deflator are in the
grow.csv
file under the headings Qgrate
and
Pgrate
, respectively. These projected rates, which are from the
African Economic Outlook 2022 (see below), are converted into growth
factors and written to the
growth_factors.csv
file
by the grow.py
script.
After gaining experience using the model, these simple assumptions
could be made more complicated. And certainly, as time passes and
baseline macro projections change, the values in the
grow.csv
file will be updated.
UN Population Projection: Ethiopia
African Economic Outlook 2022: Ethiopia