Words Vote.
"Words Vote." employs Sunlight Labs' "capitolwords" python library to obtain lists of words used by individual congresspeople during an interval of time (usually before a major vote). Also grabbing data for the selected and associated single roll call vote from Govtrack's XML records, it uses Bayesian statistics to determine which words are most informative in predicting a congressperson's vote.
Purpose
We live in an age closely attuned to rhetoric. Political agents, when speaking on the floors of Congress, carefully deploy language to have the maximum impact in a media age. Although we may feel that we know which words and types of language are associated with certain positions, statistical analysis can discover many unconsidered things which reveal much about our political discourse.
Although linguistics researchers already have access to established corpora, the rise of syndicated web technologies is permitting machine-friendly corpora to emerge in real-time and to be used by non-professionals for political or casual research. In association with other syndicated categorization systems (here vote records maintained by govtrack), machines can help us identify many interesting and hidden associations in political speech and elsewhere.
In this project, we deploy two such technologies. First, Sunlight Labs has developed an API and associated Python library that identifies the most commonly-used words in the Congressional Record by speaker. Second, Govtrack maintains machine-readable records of Congressional voting that can be scraped and processed.
Dependencies
The program is merely a single python script (wordsvote.py) with a pickled file that includs a dictionary. Run via "python wordsvote.py"
- At least a 2.5 python installation with standard python libraries, including pygtk and xml.etree (which should be included in most versions of python)
- nltk (the natural language toolkit) http://www.nltk.org
- python-capitolwords http://github.com/sunlightlabs/python-capitolwords/tree/master
- simple-json (required for python-capitol words) http://http://pypi.python.org/pypi/simplejson/
To-Do
- Currently the software uses only Internet calls for data and pure Python for processing. This is very slow, but should be fine for now for something experimental when a user might be interested in running a scan of a particular debate to get a sense of the key discursive elements. I don't imagine that people are looking for great efficiency and polish here. So while it is important to improve the speed of the machine learning, the data calls are currently the most time consuming part.
- Not necessary for people to download all of nltk. The statistics package should move away from this.
- Add additional statistical measures of discourse and speech elements. This is difficult with only words, but there are a few possibilities. Because the Sunlight Labs feed already cuts some words, some of these measures would not be completely statistically valid.
- Add search for bills
- GTK does not operate (well?) in Windows. Fix and create windows installer.
- Include automatic references to a number of key votes.
- Store the most recent data download so that it doesn't have to be reloaded if a single parameter is changed (currently the function that processes the word feeds automatically cuts the speakers with an insufficient number of words).
- Sunlight Labs' API goes back to 2000 or so. Is it possible to obtain corpora indexed by speaker farther back?
Interesting Votes to Check
- March 8, 2001 (H) -- Bush Tax Cuts
- May 23, 2001 (H), June 14, 2001 (S) -- No Child Left Behind
- April 24, 2002 -- Sarbanes-Oxley
- February 14, 2002 (H), March 20, 2002 (S) -- McCain-Feingold
- October 10, 2002 (H), October 11, 2002 (S) -- Iraq War Authorization
- March 13, 2003 (S), June 4, 2003 (H) -- Partial Birth Abortion
- July 12. 2007 (H) -- Responsible Redeployment for Iraq Act