The Tika project is an attempt to build a generic content analysis toolkit based on ideas and code from Apache Nutch and other related projects. The goal is to create a reusable core of content analysis functionality, including features like:
- MimeType Repository
- Language Identifier
- Content Signature
- Generic Meta Data Infrastructure
- Charset Detector
- Parse Plugins Framework
The eventual goal of the project is to become an Apache Lucene subproject.