My favorites | Sign in
Project Logo
                
Feeds:

The main goal of WARC Tools is to facilitate and promote the adoption of the WARC file format for storing web archives by the mainstream web development community by providing an open source software library, a set of command line tools, web server plug-ins and technical documentation for manipulation and management of web archive files, or WARC files.

WARC files are produced by web archiving crawlers, such as Heritrix, the open-source, extensible, Web-scale, archiving quality Web crawler developed by the Internet Archive with the Nordic National Libraries, and Hanzo's own commercial crawlers.

The project is lead by Hanzo Archives, in collaboration with Internet Archive Web team, and supported by the International Internet Preservation Consortium (IIPC).

WARC Tools are to be implemented in a set of core libraries, and the functionality to be made available to end users as command line tools, extensions to existing tools, and simple web applications for accessing WARC content. In addition all the libraries will have APIs and dynamic language bindings and will be made available as software libraries for developers.

The library and tools will be scriptable (command lines in shell scripts, dynamic language bindings to the library), and programmable (dynamic language bindings, Java packages, and the C library itself).

Migration and interoperability with legacy tools is important and this project will implement functionality for these tools and the artefacts they create, in order to enable rapid progression and adoption of WARC by the mainstream. To this end, integrating or linking to HTTrack, curl and wget for example are all going to be part of the Hanzo WARC Tools. These will be released as open source code, together with installers, documentation and man pages.

The library and tools will be implemented in ANSI C and will be highly portable, with build/installation on various Linux and Unix distributions, as well as Windows, together with unix man pages, build and installation guides, developer guides, etc.

The project mailing list is here: http://groups.google.com/group/warc-tools

Project Proposal is here: http://netpreserve.org/forum/viewtopic.php?t=171&sid=73a8903e04ea2e8fb091e8c7e9e618b1 (IIPC members only)

Hanzo are extending WARC Tools with full text search and an analytics API, take a look at the Search Tools project.









Hosted by Google Code