My favorites | Sign in
Project Home Downloads Wiki Issues Source
Project Information
Members

This dataset has three sub-datasets:

Contact, Pizza, Hotel

For each sub-dataset:

The file named "url" contains a list of all downloaded URLs and their assigned IDs, also indicates whether the saved file tagged correctly by the rule based extractor.

format:

<id>
<0/1(tagged correctly/not)> <total number of addresses extracted correctly> <total number of address extracted> <total number of addresses in the web page>
<url>

The path named "original" contains all web pages collected without any tagging.

The path named "tagged" contains all tagged web pages.

file name format: f0000id

Powered by Google Project Hosting