MYI-Tax-Analyzer: Using Data Analysis Software

Using Data Analysis Software to Conduct Custom Analysis

This document describes using procedural programming to conduct custom analysis. A declarative programming approach uses SQL to construct custom tables.

The GUI and CLI provide basic tables and several kinds of graphs that show the effects of a reform, so it can meet the needs of most tax reform analysis projects.

However, sometimes there is a need to produce a table or graph that is specific to the reform being analyzed. The tax model itself has no way of knowing in advance what kind of custom table or graph is needed for a particular reform, so its approach is to provide the basic information required for any custom analysis and let users perform custom analysis using the software of their choice.

The model provides this basic information by writing out a CSV-formatted dump output file containing, for each tax filing unit in the sample, the input variables and the variables calculated by the model under the specified reform. Two such output files, typically one for current-law policy and the other for some reform policy, can support any kind of custom analysis of the effects of a reform.

The key to this approach to using the tax model is having experience with some kind of data analysis software. Python and R are among the most popular open-source data analysis environments, while Stata and SAS are among the most popular proprietary tools. If you don't had any prior experience with this kind of software and you want to do custom tax analysis, the most sensible approach is to learn how to use Python and its matplotlib graphing package, which are already installed on your computer. And there is code in the Tax-Analyzer-Framework analyzer.py and utils.py modules that can help you get started making a time-series graph or a cross-section graph

Below we illustrate this approach using Python, but users of other data analysis tools will see immediately how to conduct this kind of custom tax analysis in the software of their choice (e.g., R, Stata, SAS, etc.).

The basic approach is to use the GUI or CLI to produce a CSV-formatted dump output file for each of two tax policies, and then to use the data analysis software to produce the custom tables and graphs needed to understand the effects of moving from the first policy to the second policy.

Producing the dump output files could be done by using the GUI or by using the CLI at the command line. A second approach is to create the dump output files by calling the CLI as a preliminary step in the custom data analysis program. The advantage of the second approach is that all the information about the nature of the two policies being compared are contained in one place. Either approach is fine; we illustrate the second all-in-one-place approach below so that users can decide how they want to work.

The rest of this document contains the following examples:

Visualizing Distribution of Returns and Taxes
Visualizing Horizontal and Vertical Equity

Visualizing the Distribution of Returns and Taxes under Current-Law Policy

Here is a simple Python program that does three things: (1) calls the CLI to generate a minimal dump output file under current-law policy, (2) does custom tabulation of the dump output writing the tabulation results to a file, and (3) calls gnuplot to visualize the custom tabulation results in a graph and save the graph in an SVG-formatted file:

$ tacat bin.py
"""
This Python program calls the MYI-Tax-Analyzer CLI and the gnuplot program
to generate a graph of the distribution across full-income groups of the
number of PIT returns, the full-income total, and the tax liability total.
The distribution is characterized by the fraction of the aggregate all-group
total that is in each full-income group; that is, the probability density
function (pdf), or histogram, for each of the three variables.
"""

import subprocess
import pandas as pd
import taf  # Tax Analyzer Framework


# specify the JSON policy file needed for the CLI call
clp = '{}'  # no reform provisions implies current-law policy
with open('clp.json', 'w') as jsonfile:
    jsonfile.write(clp)

# call CLI to generate a minimal dump output file for current-law policy
cmd = 'myita tax.csv 2018 clp.json --notable --noparam --silent'
subprocess.run(cmd.split(), check=True)

# read in the CLI-generated output file as a pandas dataframe
odf = pd.read_csv('tax-18-clp.csv')

# delete the no-longer-needed myita files
taf.delete_file('clp.json')
taf.delete_file('tax-18-clp.csv')

# compute aggregate (all-group) totals
agg_returns = odf.weight.sum()
agg_fullinc = (odf.fullinc * odf.weight).sum()
agg_pitax = (odf.pitax * odf.weight).sum()

# write the bin.dat file with information suitable for the graph
datafile = open('bin.dat', 'w')
datafile.write('# x returns fullinc pitax\n')
dataline = '{:.1f} {:.3f} {:.3f} {:.3f}\n'
bin_edges = [-9e99, -0.005, 25e3, 50e3, 75e3, 100e3, 150e3,
             200e3, 300e3, 400e3, 500e3, 700e3, 1e6, 1.5e6, 9e99]
groupby = odf.groupby(pd.cut(odf.fullinc, bin_edges))
for binrange, bindf in groupby:
    lo = binrange.left
    if lo < -1e99:
        lo = -10e6
    hi = binrange.right
    if hi > 1e99:
        hi = 10e6
    returns = bindf.weight.sum() / agg_returns
    fullinc = (bindf.fullinc * bindf.weight).sum() / agg_fullinc
    pitax = (bindf.pitax * bindf.weight).sum() / agg_pitax
    datafile.write(
        dataline.format(lo * 1e-3, returns, fullinc, pitax)
    )
    datafile.write(
        dataline.format(hi * 1e-3, returns, fullinc, pitax)
    )
datafile.close()

# specify the gnuplot program that graphs the bin.dat file
plt = """
set terminal svg fixed size 700,480
set output 'bin.svg'
set title "Distribution of PIT Returns, Full Income, and Tax Liability" 
set xlabel "Full Income (thousands of ringgit)"
set xrange[-100:2000]
set xtics out nomirror
set mxtics 5
set ylabel "Fraction in each Full-Income Group"
set yrange[0:0.35]
set mytics 5
set style line 1 lw 2 lc rgb "black"
set style line 2 lw 2 lc rgb "sea-green"
set style line 3 lw 2 lc rgb "salmon"
plot 'bin.dat' using 1:2 with lines ls 1 title "returns", \
            '' using 1:3 with lines ls 2 title "income", \
            '' using 1:4 with lines ls 3 title "taxes"
"""
with open('bin.plt', 'w') as pltfile:
    pltfile.write(plt)

# call gnuplot to generate the graph and write it in the bin.svg file
cmd = 'gnuplot bin.plt'
subprocess.run(cmd.split(), check=True)

# delete the no-longer-needed bin.dat and bin.plt files
taf.delete_file('bin.dat')
taf.delete_file('bin.plt')

Instead of writing a gnuplot program and calling it from bin.py, we could have used, for example, the matplotlib Python package to generate the graph and write it to a file. Either approach will work fine; the choice depends on the personal preference and experience of the model user.

When bin.py is executed at the command prompt (using the python bin.py command), it writes the graph in the bin.svg file. The graph looks like this:

distribution-graph

Visualizing the Horizontal and Vertical Equity of Current Law

Explanation and code to be added ...

hvequity-graph