My favorites | Sign in
Project Home Downloads Wiki Issues Source
Project Information
Members

This project contains resources for conducting natural language processing research using data from Twitter (www.twitter.com).

Currently available are a set of 1827 tweets manually-annotated with part-of-speech tags, a tweet tokenizer, a part-of-speech tagger trained on the annotated data, and a simple browser-based annotation interface that was used to perform the annotation. These can be downloaded from the Downloads tab. There is also a GitHub repository hosting the tagger code. The home of this project is here.

Milestone releases that include different annotations or give different tagging accuracy will be explicitly made available here, but for the latest version with other changes that do not affect accuracy (speed-ups, minor bug fixes, etc.), see the GitHub page. These releases can be downloaded from the project downloads page (click on "Download as zip" or Download as tar.gz").

The data and tools are described in the ACL 2011 paper "Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments" by Kevin Gimpel, Nathan Schneider, Brendan O'Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. [pdf]

To receive announcements about major updates to these releases, join the ARK-tools mailing list.

Please note that not all components are released under the same license, hence our use of "Other Open Source" as the code license shown to the left. Each download contains full license information within it.

Update! 11/8/2011: Major speed improvement to tagger. Get latest version from GitHub downloads page. Direct links: zip tar.gz

Powered by Google Project Hosting