Tax-Analyzer-Framework

Data Preparation Guide
Login

Income Survey Data Examples

If tax return data are not available, the next best alternative is to use income survey data to prepare the input needed by a personal income tax (PIT) microsimulation model. Depending on the nature of the PIT and the scope of the income survey data, there are likely to be a wide range of problems encountered when undertaking this kind of data preparation. All these problems boil down to some kind of missing data problem. The variable that indicates whether or not the person is a tax filer is always missing from income survey data. And often kinds of non-employment income that are taxable are missing. And usually income survey data do not include information on deductible expenses that are required to compute taxable income. All these missing data problems are compounded by the fact that micro tax return data are unavailable, and hence, the MICE algorithm cannot be used to impute the missing variables.

Experience has shown that these missing data problems will not be able to be solved in a plausible way without access to aggregate statistics on the PIT for the year corresponding to the survey data. Such aggregate statistics include basic ones such as the total number of PIT returns and the total PIT liability. And it would be ideal to have return counts and liability totals for several filer groups especially high-income files. The aggregate statistics on high-income filers are particularly helpful because it is well known that high incomes are under represented in most income surveys. On the dangers of using methods that rely on assumptions about the shape of the income distribution without relying on additional aggregate statistics, read this harrowing account of using such a method with real-world Malaysia data.

The examples discussed in this section of the guide are in order of increasing availability of additional aggregate information.

The first example assumes the bare minimum of aggregate tax statistics are available: a population total, and tax revenue totals for three different types of income. The limited aggregate information means that only the most simple adjustments to sampling weights and reported wage income are able to be made in addition to making a simple assumption about tax filing behavior. This data preparation situation was encountered when developing a PIT microsimulation model for Ethiopia.

The second set of examples assumes more aggregate tax statistics are available, particularly regarding the number of high-income tax returns and the tax revenue generated from those high-income returns. There is a considerable literature on how to correct for the under-representation of rich families and the under-reporting of high incomes in household surveys, to which Lustig provides an excellent review. These problems may be acute when using an income survey sample in place of a tax return sample if there are fewer/smaller high incomes in the survey data than in the return data. Our experience indicates that different methods are adequate to solve this kind of problem depending on the nature of the data. So, this second set of examples contains a discussion of a variety of methods and our experiences using them with both synthetic data and with real-world data for Malaysia.