Tax-Analyzer-Framework

Data Preparation Guide
Login

Preparing Data for Personal Income Tax Microsimulation Models

Since 2020 the World Bank Group's global tax team has sponsored the development of an open-source Tax Analyzer Framework that makes it easier to develop microsimulation models. Models developed using this Framework have a full-featured graphical user interface (GUI) that means the models can be used by people that have no computer programming skills. The Framework has matured over the years and has been used to develop models in several countries.

Since 2022 the global tax team has sponsored work that would make it easier to construct the data set used as input for a model.

Typically a model uses a sample of tax returns as input data. This approach works well when estimating the aggregate revenue and distributional effects of tax reforms that affect only those who are filing tax returns under current-law policy. While adequate in many cases, there can be a variety of problems with using tax return data as model input:

  1. Tax return data might result in such a large input data file that model execution times are too long and the dump output of a model run is so large that it makes post-simulation data analysis unwieldy. In such cases, drawing a stratified random sample from the full data file can be a useful approach.

  2. Even when tax return data are available, they may be incomplete given the country's tax administration system. In these cases, certain tax-relevant income variables need to be imputed for filers who have missing values.

  3. Even complete tax return data will be inadequate if there is interest in analyzing the effects a tax reform (such as new refundable tax credit) that will bring new filers into the tax system or if there is a need to compute population-wide distributional statistics (such as a Gini coefficient) in countries where only a modest fraction of the population are tax filers. In such cases, survey data needs to be somehow added to the tax return data, and ideally, when doing that demographic variables (that are typically missing in tax return data) need to be imputed to the tax filers.

  4. Tax return data may unavailable because the government wants to see a demonstration of what a microsimulation model can do before sharing tax return data. In this case, the input data needs to be constructed from income or consumption survey data.

Since 2022 a number of statistical techniques have been used to solve data preparation problems. This guide presents what has been learned from this experience. The presentation involves a discussion of a series of examples, each one of which illustrates a data preparation problem. Each discussion describes the nature of the problem, how it can be solved, and what software tools can be used to solve the problem. But typically each country presents a unique set of data problems, so the degree of automation possible in this area is limited.

Several data preparation tools have been added to the Tax Analyzer Framework to facilitate the solving of various data preparation problems. Those tools are discussed in the data preparation tools documentation.

While these tools do make it easier solve a range of problems that come up in the preparation of model input data, the preparation work does require the development of computer programs and an understanding of the statistical techniques used by the tools. These tools are written in Python, so to use them one would need to develop Python computer programs. Obviously, custom computer programs could be developed in another language like R, which is the most popular open-source language after Python. Or custom programs could be developed using proprietary software such as Stata or SAS.

In addition to these tools, all the data preparation examples assume that a working microsimulation model is available to process the prepared data and generate aggregate and distributional statistics on tax liability. Having a working model provides an essential way to assess the results of each possible approach to solving a data preparation problem.

Experience has shown the the nature of the data preparation problems in each country are likely to be unique. So there is no way to build a "data preparation framework" that can solve all problems. The best that can be done is to discuss a variety of real-world situations and explain how various methods can be used to solve the problems.

We do that by first discussing several problems that have come up when the country's government has made tax returns available. Then we discuss situations in which micro tax return data is not available, but the government has provided some aggregate tax statistics and a household income survey is available for that year. And finally, we discuss situations in which the only micro data available is a household consumption survey.

Contents: