My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
MultiplyIndexedData  
Documents with multiple topic sets
Updated Jul 27, 2009 by medel...@gmail.com

Multiply indexed data

This page describes three collections available for download that were used in experiments with Maui. See Examples or Medelyan, 2009 (Publications) for examples of topics automatically assigned to these documents by Maui.

Unlike collections with just one topic set per document (some are listed in Resources), these three collections contain topic sets assigned to each document by different people. This allows to measure the agreement between the people, which provides a direct comparison to the performance of the algorithm.

FAO-30 data set for term assignment

FAO-30 is a test set containing 30 agricultural documents, each indexed with terms from Agrovoc by 6 professional indexers at the Food and Agriculture Organization of the UN.

When using this test set, please cite Medelyan (2009) or Medelyan and Witten (2008), see Publications.

WIKI-20 data set for topic indexing with Wikipedia

WIKI-20 is a test set with 20 computer science technical reports, each indexed with terms from Wikipedia by 15 teams of computer science graduate and undergraduate students. The test set was created in an indexing competition, which ensured high quality of assigned topics.

When using this test set, please cite Medelyan (2009) or Medelyan et al. (2008), see Publications.

CiteULike-180 data set for automatic tagging

CiteULike-180 is the only test set listed here that was created in natural environments. It has been automatically extracted from the large data set of tags assigned to the bookmarking platform CiteULike.

The following restrictions were applied to extract this test set:

  • each tag should be assigned by at least two people
  • each tagger should have at least two co-tagger
  • each tagger should have tagged at least three documents
  • the document should be available for free download

The resulting set contains 180 science articles from HighWire and Nature, with tags assigned by 332 voluntary taggers on CiteULike.

When using this test set, please cite Medelyan (2009) or Medelyan et al. (2009), see Publications.

How to measure consistency

Consistency, or inter-indexer consistency is measured using a traditional measure used in library science, proposed by Rolling (1981).

The formula is 2C/(A+B), where C is the number of topics two sets have in common and A and B are the total numbers of topics in each set.

Given two topic sets:

  • complex systems, network, small world
  • theoretical, small world, networks, dynamics

The intersection, or the set of topics the two sets have in common (after stemming) is

  • network, small world

This gives the Rolling consistency of 2×2/(3+4) = 0.57. To compute precision and recall, one of the sets needs to be seen as the gold standard. If the first set was assigned automatically, the precision is 2/3 = 0.66 and recall is 2/4 = 0.5. The F-measure is 2×0.66×0.5/(0.66+0.5) = 0.57, the same as Rolling.

Note: To compute indexing consistency of a person or an algorithm, the values need to be computed for each document and co-indexer and then averaged.


Sign in to add a comment
Powered by Google Project Hosting