Tax-Analyzer-Framework: Model-Use Tools

The Tax-Analyzer-Framework package not only provides a framework for developing and using tax microsimulation models, but also includes several tools that make it easier to use the models. These tools are all optional, but they can increase productivity when working at the command prompt with a Framework-derived model (especially on Windows).

Here is the table of contents for this document:

Working Efficiently at the Command Prompt
CSV-file Tools for All Users
- csvshow
- csvjoin
Unix-like Tools Mostly for Windows Users
- tals
- tacat
- tahead
- tatail
- tagrep
- tadiff
- tagdiff for all users
- taunzip
JSON-file Validation Tool for All Users
SQLite as Data Analysis Software
AWK as Data Analysis Software

Working Efficiently at the Command Prompt

The first concern many people have about working at the command prompt is that there will be too much typing. But this concern is misplaced as the command prompt on all computers comes with a command history feature that allows quick recall, and editing if necessary, of commands that were previously entered. Be sure you understand how to use this feature of your operating system, which is accessed using the up (and down) arrow keys. Also, be sure you know how to use the tab key to have the operating system automatically complete a file name you have started typing.

In addition, you should be familiar with basic concepts: listing the files in the current directory (using dir on Windows or ls -l otherwise), changing directories (using cd), printing to stdout the complete contents of a text file (using type on Windows or cat on Mac/Linux or the tacat command described below), redirecting stdout to a file (using the IO-redirection character >), passing the stdout from one tool to the stdin of the next tool in a pipeline (using the pipe character |), and quickly generating lists of many file names (using the file name wildcard characters ? and *). In addition, you should know how to script several commands in a Windows batch file or in a bash script on Mac or Linux.

CSV-file Tools for All Users

This section describes two tools included in the Framework package that make it easy to look at the contents of a row in CSV file and to merge two (or more) CSV files. These are particularly useful when working with large CSV files with hundreds of variables and thousands of rows. These operations can be done in any data analysis software, but these tools give users quick and easy options when working at the command line.

csvshow

The csvshow tool shows a row in a CSV-formatted file that has the specified id value by printing a line for each variable. Each line contains three fields: variable index position (index begins at 1), variable name, and variable value. If no id is specified, then only the first two fields are printed. There is a --nozeros option that suppresses printing of lines where the variable value is zero. And there is a --idvar IDVAR option that allows you to use this tool on files in which the id variable name is something other than the default recid.

This tool is useful for finding the index position of variables (which is needed when writing a one-line AWK program) or for simply looking (in a more readable format) at one row in a CSV-formatted file.

csvjoin

The csvjoin tool joins (in SQL parlance) variables row-by-row from different CSV-formatted files in a new CSV-formatted output file. You must specify three types of information: CSV-file names using the required --csv argument, the variables using the required --var argument, and the name of the output file using the required --out argument. The csvjoin tool checks to make sure that the join is sensible, in that each --csv file has the same id values, and has the same values for a list of other variables that should be the same in order to join the files (such as year and weight).

When you specify just one --csv file, the csvjoin tool does an simple extract including just the specified --var variables in the specified --out file.

Merging or joining two or more CSV-formatted files can be done in your favorite data analysis software, but you would be more productive using the csvjoin tool and then importing the resulting --out file into your data analysis software. This eliminates having to write custom join code every time you want to compare taxes under a reform and current-law policy or under two or more different reforms.

Here is what you get after entering csvjoin --help on the command line:

usage: csvjoin [--help] --csv CSV [CSV ...] [--idvar IDVAR]
               --var VAR [VAR ...] --out OUT [--svar SV [SV ...]]

Write CSV-formatted data to file named OUT that joins data specified by --var
variables using the IDVAR variable, which defaults to recid. In the OUT file
there will be only one IDVAR field but there will be as many VAR variables as
there are --csv files being joined for only the variables that have different
values in the --csv files. So, in the OUT file, for each VAR variable that
differs, the VAR variable from the 1st CSV is called VAR_1, the VAR variable
from the 2nd CSV is called VAR_2, and so forth. Each variable that is the same
across all the --csv files, will appear in the OUT file just once without any
suffix.

optional arguments:
  -h, --help           show this help message and exit
  --csv CSV [CSV ...]  REQUIRED names of CSV files that are to be joined. Note
                       that specifying just one file implies a simple
                       extraction of the VAR variables without any suffixes.
  --idvar IDVAR        OPTIONAL name of id variable used to conduct the join.
                       No --idvar option implies recid is used.
  --var VAR [VAR ...]  REQUIRED names of variables to include in the OUT file.
                       Note that variables will appear in the OUT file in the
                       order they are listed on the command line (except that
                       IDVAR is always first).
  --out OUT            REQUIRED name of the joined CSV-formatted output file,
                       which will be overwritten if it already exists.
  --svar SV [SV ...]   OPTIONAL names of variables (in addition to IDVAR) that
                       must be the same across all the CSV files (if they
                       appear in the VAR list). No --svar option implies an
                       empty list.

Unix-like Tools Mostly for Windows Users

The tools discussed in this section are included in the Framework package mostly for the benefit of Windows users. Other computers come with full-featured versions of these tools built into their operating systems (except for the graphical diff tool).

tals

The tals tool, which is a limited-capability version of the Unix ls -l tool, prints to stdout information about each of the files and subdirectories in the current working directory. Enter tals --help for more information.

tacat

The tacat tool, which is a limited-capability version of the Unix cat tool, prints to stdout all lines in a text file. Like the Unix tool, more than one file name can be specified. Enter tacat --help for more information.

tahead

The tahead tool, which is a limited-capability version of the Unix head tool, prints to stdout the first N lines of a text file. The default value of N is ten. Like the Unix tool, more than one file name can be specified or wildcard characters can be used. Enter tahead --help for more information.

tatail

The tatail tool, which is a limited-capability version of the Unix tail tool, prints to stdout the last N lines of a text file. The default value of N is ten. Like the Unix tool, more than one file name can be specified or wildcard characters can be used. Enter tatail --help for more information.

tagrep

The tagrep tool, which is a limited-capability version of the Unix grep -F or fgrep tool, prints to stdout the lines of a text file that include one or more of the specified search strings (-e STR). Like the Unix tool, more than one file name can be specified or wildcard characters can be used. There are options to turn off case sensitivity in the matching (-i), to invert the matching logic (-v), and to print a specified number of context lines before each matching line (-B). Enter tagrep --help for more information.

If you are working on a Windows computer, consider downloading a full-featured version of grep from the ezwinports website.

tadiff

The tadiff tool, which is a limited-capability version of the Unix diff tool, prints to stdout the lines that differ between two text files. Enter tadiff --help for more information. .

tagdiff for all users

The tagdiff tool, which is a limited-capability version of the many graphical diff tools that can be downloaded for any computer, shows differences (as produced by the tadiff command) side-by-side in your default web browser with full context. Enter tagdiff --help for more information.

All users, not just those on Windows, should give serious consideration to downloading a full-featured graphical diff tool.

taunzip

The taunzip tool, which is a limited-capability version of the standard unzip utility, unzips a .zip file without decompressing any .gz files included in the .zip file. This feature is allows Windows users to avoid the unwanted behavior of drag-and-drop not only unzipping the .zip file but also decompressing any .gz files included in the .zip file. Enter taunzip --help for more information.

JSON-file Validation Tool for All Users

The jsoncheck tool reports JSON language syntax errors in the specified file. This tool is useful when using the JSON CLI, which requires model users to write JSON files in a text editor. Enter jsoncheck --help for more information.

SQLite as Data Analysis Software

Some model users may be familiar with Structured Query Language (SQL), which is the declarative language used by relational database systems such as Oracle, SQL Server, and DB2. If so, then writing SQL SELECT statements that are run against the database version of the dump output generated by Framework-derived tax models can be an efficient way to conduct custom tabulation of model output. Each Framework-derived tax model comes with the sqlite3 command-line shell program that can process SQL queries against the dump output when it is written as an SQLite3 database file using either a model's GUI or CLI. The sqlite3 program can be used interactively or can be used to execute SQL commands contained in a text file. This ability to execute SQL commands in a file means that a complex sequence of SQL commands can be applied to the output of each model run without any additional work.

AWK as Data Analysis Software

Model users may be familiar with some data analysis software such as R, Python, Stata, SAS, etc., all of which have a procedural programming approach. Any of these can import the CSV-formatted output files generated by Framework-derived tax models. And familiarity is key to working efficiently.

But there are situations when the non-CSV table output needs to be tabulated or situations where using one-line, throw-away tabulation programs executed at the command prompt can speed-up data analysis. An example of repetitive use of a simple multiple-line AWK program to tabulate tax model output can be found in the documentation of the US wealth tax model, which is built using the Tax-Analyzer-Framework.

If you want to explore this approach, you need to learn how to write simple AWK programs. AWK comes built-in on Linux and Mac computers; Windows users can download the GNU version of awk (called gawk) from the ezwinports website. A comprehensive manual for AWK has been written by the head gawk maintainer and is entitled GAWK: Effective AWK Programming. It is available in both HTML format and PDF format.