tika - DesignDiscussion.wiki

Tika design discussion

Introduction

This page is intended as a discussion page for outlining the requirements for Tika, a generic document parsing framework. As a project designed to be embedded in many different applications such as search engines, content management systems or information extraction frameworks, the design needs to reflect many requirements and deployment scenarios.

There are a number of headings below which will initially contain more questions than answers.

Apologies if questions are asking the obvious-I'm new here and need to play catch-up(Mark)

Design

As a wrap-up from the discussion at ApacheCon EU, I've created an ArchitectureSketch.

What content should returned by parsers?

Plain text - plain java.lang.String or stream-based e.g. Reader/Writer?
- I prefer a stream-based approach to avoid potentially unlimited memory requirements. (Jukka)
  - Should we return a Reader that a client can consume at leisure or accept a Writer to which the parsed text content is written directly? (Jukka)
  - The Writer approach is easier in that it requires less resource coordination and in some cases avoids the need for temporary files. (Jukka)
  - The Reader approach gives more control to the client. (Jukka)
- String or Reader/Writer is better than Input/OutputStream since the parser should already handle character encoding issues. (Jukka)
Document metadata
- Dublin core-based attributes? As a map or hard coded?
  - Dublin core is probably the best there is, but might still not be direclty applicable in general (it was designed for a specific domain). (Jukka)
  - One idea I had was to use an event-based approach for such metadata. It would then be easy to extend the set of reported metadata simply by adding new "metadata events". (Jukka)
- A "normalized" rich text markup? E.g. a basic HTML subset with support for tables/headers - "HTML-lite". May be useful when displaying parsed documents e.g. Stellent's viewer
  - It doesn't make sense to require the client to use yet another parser to parse the output of the content parser... (Jukka)
  - A SAX-like event interface would probably make sense, as clients that care about formatting could listen for formatting events while other clients could just extract the raw character content. This approach would also play well with the above "metadata event" idea. (Jukka)
- "Annotations"? (see "post-parser analysis" section)
Multiple output "documents" per input document? e.g. RSS or CSV
- Also packaging formats like tar, zip, etc. (Jukka)
- The above event mechanims could be used to fairly easily add support for such multiple "documents" per input stream. (Jukka)

How is content presented to Tika parsers?

InputStream
- InputStream is probably best as it's the least common denominator of any input mechanisms. (Jukka)
- How to handle if a parser needs to do multiple (partial) passes over the input document? (Jukka)
  - Use a temporary file? (Jukka)
  - Use a BufferedInputStream overlay and the InputStream.mark() method? (Jukka)
  - Use some other input mechanims than InputStream? (Jukka)
Any extra "context"
- Encoding
- File extension
- Content type header from HTTP/SMTP/etc.
- Resource forks, extended attributes, etc. within the file system
- Content within the input document that could be used for selecting the appropriate parser:
  - Byte order marks
  - Magic numbers
  - Shebang comments in Unix scripts
  - emacs/vi style content directives
- ...?

I see the existing TIKA API has Mimetype+encoding in the parser interface(presumably readily available in Nutch from Http headers) but would this work for file-system based scenarios too? I presume Tika could do this by maintaining a file extension to Mime-type mapping somewhere.

I think we need to both extend the Tika API to allow more generic "input metadata" and and add some file extension to MIME type mappings. (Jukka)

Does Tika support post-parser analysis?

Once normalized, a number of "analyzers" could be used to add extra information as "annotation" objects - byte offset and length info identifying an area of text with a Map of information detailing the discovery.

The event mechanims mentioned above would nicely support such annotations. (Jukka)

Possible examples include:

URL detection (currently done by Nutch?)
Paragraph detection
Postcode/phone number/person detection

All starts to sound a bit like Gate though. See their Annotation class for examples.

Will Tika provide a Parser Factory/Registry?

How are parsers registered? Is there a configuration system to allow individual parser 's behaviour to be tweaked at factory start-up e.g. using a generic IOC framework such as Spring?
- IoC would be nice, but requires extra support from the environment. Perhaps we should use the ServiceFactory approach? (Jukka)
What criteria is used to decide on the selection of a particular parser for a document?
- Mime type
- File extension
- auto detect from content
- Client request for parser functionality?
  - Speed
  - Robustness
  - Ability to extract certain metadata e.g. "author" or ability to return normalized HTML-lite markup.
Are there "fall-back" options for parsers? Open-source parsers have been known to be unable to parse certain documents. The Parser framework could have some notion of alternative parser implementations to fall back on if any one implementation is unable to deal with a particular "problem" document.
- +1, see the comment below on preferred parsers (Jukka)

Miscellaneous questions

Are parsers expected to be thread-safe?
- We could use a thread-safe Factory mechanism that instantiates a new parser instance for each parsing task. The individual parser instances wouldn't need to be thread-safe, but there could be multiple instances working in parallel. (Jukka)
Java 1.4 or 1.5?
- Java 5 would be nice, but for a general purpose library I think Java 1.4 is still the best platform to use. (Jukka)
How to handle license issues behind different parser implementations?
- Good question... (Jukka)
What are the preferred parsers for each document type?
- Perhaps the framework should support alternative parsers per document type together with some (configurable?) priority and fallback mechanisms. (Jukka)

Code

Archive