Using Data Analysis Software to Conduct Custom Analysis
This document describes using procedural programming to conduct custom analysis. A declarative programming approach uses SQL to construct custom tables.
The GUI and CLI provide basic tables and several kinds of graphs that show the effects of a reform, so it can meet the needs of most tax reform analysis projects.
However, sometimes there is a need to produce a table or graph that is specific to the reform being analyzed. The tax model itself has no way of knowing in advance what kind of custom table or graph is needed for a particular reform, so its approach is to provide the basic information required for any custom analysis and let users perform custom analysis using the software of their choice.
The model provides this basic information by writing out a CSV-formatted dump output file containing, for each tax filing unit in the sample, the input variables and the variables calculated by the model under the specified reform. Two such output files, typically one for current-law policy and the other for some reform policy, can support any kind of custom analysis of the effects of a reform.
The key to this approach to using the tax model is having experience
with some kind of data analysis software. Python and R are among the
most popular open-source data analysis environments, while Stata and
SAS are among the most popular proprietary tools. If you don't had
any prior experience with this kind of software and you want to do
custom tax analysis, the most sensible approach is to learn how to use
Python and its matplotlib
graphing package, which are already
installed on your computer. And there is code in the
Tax-Analyzer-Framework analyzer.py
and utils.py
modules
that can help you get started making a time-series graph or a
cross-section graph
Below we illustrate this approach using Python, but users of other data analysis tools will see immediately how to conduct this kind of custom tax analysis in the software of their choice (e.g., R, Stata, SAS, etc.).
The basic approach is to use the GUI or CLI to produce a CSV-formatted dump output file for each of two tax policies, and then to use the data analysis software to produce the custom tables and graphs needed to understand the effects of moving from the first policy to the second policy.
Producing the dump output files could be done by using the GUI or by using the CLI at the command line. A second approach is to create the dump output files by calling the CLI as a preliminary step in the custom data analysis program. The advantage of the second approach is that all the information about the nature of the two policies being compared are contained in one place. Either approach is fine; we illustrate the second all-in-one-place approach below so that users can decide how they want to work.
The rest of this document contains the following examples:
Visualizing the Distribution of Returns and Taxes under Current-Law Policy
Here is a simple Python program that does three things: (1) calls the CLI to generate a minimal dump output file under current-law policy, (2) does custom tabulation of the dump output writing the tabulation results to a file, and (3) calls gnuplot to visualize the custom tabulation results in a graph and save the graph in an SVG-formatted file:
$ tacat bin.py """ This Python program calls the MYI-Tax-Analyzer CLI and the gnuplot program to generate a graph of the distribution across full-income groups of the number of PIT returns, the full-income total, and the tax liability total. The distribution is characterized by the fraction of the aggregate all-group total that is in each full-income group; that is, the probability density function (pdf), or histogram, for each of the three variables. """ import subprocess import pandas as pd import taf # Tax Analyzer Framework # specify the JSON policy file needed for the CLI call clp = '{}' # no reform provisions implies current-law policy with open('clp.json', 'w') as jsonfile: jsonfile.write(clp) # call CLI to generate a minimal dump output file for current-law policy cmd = 'myita tax.csv 2018 clp.json --notable --noparam --silent' subprocess.run(cmd.split(), check=True) # read in the CLI-generated output file as a pandas dataframe odf = pd.read_csv('tax-18-clp.csv') # delete the no-longer-needed myita files taf.delete_file('clp.json') taf.delete_file('tax-18-clp.csv') # compute aggregate (all-group) totals agg_returns = odf.weight.sum() agg_fullinc = (odf.fullinc * odf.weight).sum() agg_pitax = (odf.pitax * odf.weight).sum() # write the bin.dat file with information suitable for the graph datafile = open('bin.dat', 'w') datafile.write('# x returns fullinc pitax\n') dataline = '{:.1f} {:.3f} {:.3f} {:.3f}\n' bin_edges = [-9e99, -0.005, 25e3, 50e3, 75e3, 100e3, 150e3, 200e3, 300e3, 400e3, 500e3, 700e3, 1e6, 1.5e6, 9e99] groupby = odf.groupby(pd.cut(odf.fullinc, bin_edges)) for binrange, bindf in groupby: lo = binrange.left if lo < -1e99: lo = -10e6 hi = binrange.right if hi > 1e99: hi = 10e6 returns = bindf.weight.sum() / agg_returns fullinc = (bindf.fullinc * bindf.weight).sum() / agg_fullinc pitax = (bindf.pitax * bindf.weight).sum() / agg_pitax datafile.write( dataline.format(lo * 1e-3, returns, fullinc, pitax) ) datafile.write( dataline.format(hi * 1e-3, returns, fullinc, pitax) ) datafile.close() # specify the gnuplot program that graphs the bin.dat file plt = """ set terminal svg fixed size 700,480 set output 'bin.svg' set title "Distribution of PIT Returns, Full Income, and Tax Liability" set xlabel "Full Income (thousands of ringgit)" set xrange[-100:2000] set xtics out nomirror set mxtics 5 set ylabel "Fraction in each Full-Income Group" set yrange[0:0.35] set mytics 5 set style line 1 lw 2 lc rgb "black" set style line 2 lw 2 lc rgb "sea-green" set style line 3 lw 2 lc rgb "salmon" plot 'bin.dat' using 1:2 with lines ls 1 title "returns", \ '' using 1:3 with lines ls 2 title "income", \ '' using 1:4 with lines ls 3 title "taxes" """ with open('bin.plt', 'w') as pltfile: pltfile.write(plt) # call gnuplot to generate the graph and write it in the bin.svg file cmd = 'gnuplot bin.plt' subprocess.run(cmd.split(), check=True) # delete the no-longer-needed bin.dat and bin.plt files taf.delete_file('bin.dat') taf.delete_file('bin.plt')
Instead of writing a gnuplot program and calling it from bin.py
, we
could have used, for example, the matplotlib
Python package to
generate the graph and write it to a file. Either approach will work
fine; the choice depends on the personal preference and experience of
the model user.
When bin.py
is executed at the command prompt (using the
python bin.py
command), it writes the graph in the bin.svg
file.
The graph looks like this:
Visualizing the Horizontal and Vertical Equity of Current Law
Explanation and code to be added ...