The Tax-Analyzer-Framework package not only provides a framework for developing and using tax microsimulation models, but also includes several tools that make it easier to use the models. These tools are all optional, but they can increase productivity when working at the command prompt with a Framework-derived model (especially on Windows).
Here is the table of contents for this document:
- Working Efficiently at the Command Prompt
- CSV-file Tools for All Users
- Unix-like Tools Mostly for Windows Users
- JSON-file Validation Tool for All Users
- SQLite as Data Analysis Software
- AWK as Data Analysis Software
Working Efficiently at the Command Prompt
The first concern many people have about working at the command prompt is that there will be too much typing. But this concern is misplaced as the command prompt on all computers comes with a command history feature that allows quick recall, and editing if necessary, of commands that were previously entered. Be sure you understand how to use this feature of your operating system, which is accessed using the up (and down) arrow keys. Also, be sure you know how to use the tab key to have the operating system automatically complete a file name you have started typing.
In addition, you should be familiar with basic concepts: listing the
files in the current directory (using dir
on Windows or ls -l
otherwise), changing directories (using cd
), printing to stdout the
complete contents of a text file (using type
on Windows or cat
on
Mac/Linux or the tacat
command described below), redirecting stdout
to a file (using the IO-redirection character >
), passing the stdout
from one tool to the stdin of the next tool in a pipeline (using the
pipe character |
), and quickly generating lists of many file names
(using the file name wildcard characters ?
and *
). In addition,
you should know how to script several commands in a Windows batch file
or in a bash script on Mac or Linux.
CSV-file Tools for All Users
This section describes two tools included in the Framework package that make it easy to look at the contents of a row in CSV file and to merge two (or more) CSV files. These are particularly useful when working with large CSV files with hundreds of variables and thousands of rows. These operations can be done in any data analysis software, but these tools give users quick and easy options when working at the command line.
csvshow
The csvshow
tool shows a row in a CSV-formatted file that has the
specified id value by printing a line for each variable. Each line
contains three fields: variable index position (index begins at 1),
variable name, and variable value. If no id is specified, then only
the first two fields are printed. There is a --nozeros
option that
suppresses printing of lines where the variable value is zero. And
there is a --idvar IDVAR
option that allows you to use this tool on
files in which the id variable name is something other than the
default recid
.
This tool is useful for finding the index position of variables (which is needed when writing a one-line AWK program) or for simply looking (in a more readable format) at one row in a CSV-formatted file.
csvjoin
The csvjoin
tool joins (in SQL parlance) variables row-by-row from
different CSV-formatted files in a new CSV-formatted output file. You
must specify three types of information: CSV-file names using the
required --csv
argument, the variables using the required --var
argument, and the name of the output file using the required --out
argument. The csvjoin
tool checks to make sure that the join is
sensible, in that each --csv
file has the same id values, and has
the same values for a list of other variables that should be the same
in order to join the files (such as year and weight).
When you specify just one --csv
file, the csvjoin
tool does an
simple extract including just the specified --var
variables in the
specified --out
file.
Merging or joining two or more CSV-formatted files can be done in your
favorite data analysis software, but you would be more productive
using the csvjoin
tool and then importing the resulting --out
file
into your data analysis software. This eliminates having to write
custom join code every time you want to compare taxes under a reform
and current-law policy or under two or more different reforms.
Here is what you get after entering csvjoin --help
on the command line:
usage: csvjoin [--help] --csv CSV [CSV ...] [--idvar IDVAR] --var VAR [VAR ...] --out OUT [--svar SV [SV ...]] Write CSV-formatted data to file named OUT that joins data specified by --var variables using the IDVAR variable, which defaults to recid. In the OUT file there will be only one IDVAR field but there will be as many VAR variables as there are --csv files being joined for only the variables that have different values in the --csv files. So, in the OUT file, for each VAR variable that differs, the VAR variable from the 1st CSV is called VAR_1, the VAR variable from the 2nd CSV is called VAR_2, and so forth. Each variable that is the same across all the --csv files, will appear in the OUT file just once without any suffix. optional arguments: -h, --help show this help message and exit --csv CSV [CSV ...] REQUIRED names of CSV files that are to be joined. Note that specifying just one file implies a simple extraction of the VAR variables without any suffixes. --idvar IDVAR OPTIONAL name of id variable used to conduct the join. No --idvar option implies recid is used. --var VAR [VAR ...] REQUIRED names of variables to include in the OUT file. Note that variables will appear in the OUT file in the order they are listed on the command line (except that IDVAR is always first). --out OUT REQUIRED name of the joined CSV-formatted output file, which will be overwritten if it already exists. --svar SV [SV ...] OPTIONAL names of variables (in addition to IDVAR) that must be the same across all the CSV files (if they appear in the VAR list). No --svar option implies an empty list.
Unix-like Tools Mostly for Windows Users
The tools discussed in this section are included in the Framework package mostly for the benefit of Windows users. Other computers come with full-featured versions of these tools built into their operating systems (except for the graphical diff tool).
tals
The tals
tool, which is a limited-capability version of the
Unix ls -l
tool, prints to stdout information about each of
the files and subdirectories in the current working directory.
Enter tals --help
for more information.
tacat
The tacat
tool, which is a limited-capability version of the Unix
cat
tool, prints to stdout all lines in a text file. Like
the Unix tool, more than one file name can be specified.
Enter tacat --help
for more information.
tahead
The tahead
tool, which is a limited-capability version of the Unix
head
tool, prints to stdout the first N lines of a text file.
The default value of N is ten. Like the Unix tool, more than one file
name can be specified or wildcard characters can be used. Enter
tahead --help
for more information.
tatail
The tatail
tool, which is a limited-capability version of the Unix
tail
tool, prints to stdout the last N lines of a text file.
The default value of N is ten. Like the Unix tool, more than one file
name can be specified or wildcard characters can be used. Enter
tatail --help
for more information.
tagrep
The tagrep
tool, which is a limited-capability version of the Unix
grep -F
or fgrep
tool, prints to stdout the lines of a text file
that include one or more of the specified search strings (-e STR
).
Like the Unix tool, more than one file name can be specified or
wildcard characters can be used. There are options to turn off case
sensitivity in the matching (-i
), to invert the matching logic
(-v
), and to print a specified number of context lines before each
matching line (-B
). Enter tagrep --help
for more information.
If you are working on a Windows computer, consider downloading a
full-featured version of grep
from the ezwinports website.
tadiff
The tadiff
tool, which is a limited-capability version of the Unix
diff
tool, prints to stdout the lines that differ between two text
files. Enter tadiff --help
for more information. .
tagdiff for all users
The tagdiff
tool, which is a limited-capability version of the many
graphical diff tools that can be downloaded for any computer, shows
differences (as produced by the tadiff
command) side-by-side in your
default web browser with full context. Enter tagdiff --help
for
more information.
All users, not just those on Windows, should give serious consideration to downloading a full-featured graphical diff tool.
taunzip
The taunzip
tool, which is a limited-capability version of the
standard unzip utility, unzips a .zip file without decompressing any
.gz files included in the .zip file. This feature is allows Windows
users to avoid the unwanted behavior of drag-and-drop not only unzipping
the .zip file but also decompressing any .gz files included in the .zip
file. Enter taunzip --help
for more information.
JSON-file Validation Tool for All Users
The jsoncheck
tool reports JSON language syntax errors in the
specified file. This tool is useful when using the JSON CLI, which
requires model users to write JSON files in a text editor. Enter
jsoncheck --help
for more information.
SQLite as Data Analysis Software
Some model users may be familiar with Structured Query Language (SQL),
which is the declarative language used by relational database systems
such as Oracle, SQL Server, and DB2. If so, then writing SQL SELECT
statements that are run against the database version of the dump
output generated by Framework-derived tax models can be an efficient
way to conduct custom tabulation of model output. Each
Framework-derived tax model comes with the sqlite3
command-line
shell program that can process SQL queries against the dump output
when it is written as an SQLite3 database file using either a model's
GUI or CLI. The sqlite3
program can be used interactively or can be
used to execute SQL commands contained in a text file. This ability
to execute SQL commands in a file means that a complex sequence of SQL
commands can be applied to the output of each model run without any
additional work.
AWK as Data Analysis Software
Model users may be familiar with some data analysis software such as R, Python, Stata, SAS, etc., all of which have a procedural programming approach. Any of these can import the CSV-formatted output files generated by Framework-derived tax models. And familiarity is key to working efficiently.
But there are situations when the non-CSV table output needs to be tabulated or situations where using one-line, throw-away tabulation programs executed at the command prompt can speed-up data analysis. An example of repetitive use of a simple multiple-line AWK program to tabulate tax model output can be found in the documentation of the US wealth tax model, which is built using the Tax-Analyzer-Framework.
If you want to explore this approach, you need to learn how to write
simple AWK programs. AWK comes built-in on Linux and Mac computers;
Windows users can download the GNU version of awk (called gawk
) from
the ezwinports website. A comprehensive manual for AWK has
been written by the head gawk
maintainer and is entitled GAWK:
Effective AWK Programming. It is available in both HTML
format and PDF format.