My favorites | Sign in
Project Logo
                
Search
for
Updated May 03, 2007 by lars.trieloff
Labels: Phase-Design, Featured
DesignDiscussion  

Tika design discussion

Introduction

This page is intended as a discussion page for outlining the requirements for Tika, a generic document parsing framework. As a project designed to be embedded in many different applications such as search engines, content management systems or information extraction frameworks, the design needs to reflect many requirements and deployment scenarios.

There are a number of headings below which will initially contain more questions than answers.

Apologies if questions are asking the obvious-I'm new here and need to play catch-up(Mark)

Design

As a wrap-up from the discussion at ApacheCon EU, I've created an ArchitectureSketch.

What content should returned by parsers?

How is content presented to Tika parsers?

I see the existing TIKA API has Mimetype+encoding in the parser interface(presumably readily available in Nutch from Http headers) but would this work for file-system based scenarios too? I presume Tika could do this by maintaining a file extension to Mime-type mapping somewhere.

Does Tika support post-parser analysis?

Once normalized, a number of "analyzers" could be used to add extra information as "annotation" objects - byte offset and length info identifying an area of text with a Map of information detailing the discovery.

Possible examples include:

All starts to sound a bit like Gate though. See their Annotation class for examples.

Will Tika provide a Parser Factory/Registry?

Miscellaneous questions


Comment by julianofs81, Aug 05, 2009

Tika can be used to together with nutch?


Sign in to add a comment
Hosted by Google Code