This summarization system, crafted for the Text Analysis Conference (TAC) evaluation campaign, generates summaries by extracting sentences that contain the most frequent word bigrams (called concepts) from the input documents. It uses Integer Linear Programming (ILP) for determining, under a length constraint that set of sentences. With this system, we obtained very good scores for the update task at the TAC'08 evals and among the best scores at TAC'09.
So far, we only released the raw code which contains a lot of dependencies to internal stuff at ICSI, but we plan to add a cleaned-up, standalone version for public use.
NEWS:
- 2009-12-28 Added TAC'09 code
DEPENDENCIES:
- glpsol , ILP solver
- splitta, sentence splitter
- icsiboost, a classifier
- nltk, for tokenization and stemming
- Berkeley Parser, a constituency parser
- mate, a dependency parser and SRL system (unreleased yet, use your own)
Note that the SRL system is only needed if you want to use sentence compression in TAC'09.
REFERENCES:
- Dan Gillick, Benoit Favre, Dilek Hakkani-Tür, Bernd Bohnet, Yang Liu, Shasha Xie, "The ICSI/UTD Summarization System at TAC 2009", to appear in Text Analysis Conference, Gaithersburg, MD (USA) - 2009
- Daniel Gillick, Benoit Favre, "A Scalable Global Model for Summarization", NAACL/HLT 2009 Workshop on Integer Linear Programming for Natural Language Processing - 2009
- Dan Gillick, Benoit Favre, Dilek Hakkani-Tür, "The ICSI Summarization System at TAC 2008", Text Analysis Conference, Gaithersburg, MD (USA) - 2008