Export to GitHub

airhead-research - issue #98

Add support for custom document preprocessing


Posted on Jul 17, 2011 by Happy Rhino

What steps will reproduce the problem? 1. Have a corpus with mixed-case or punctuation 2. Run any of the algorithms

What is the expected output? What do you see instead?

The output would have things lower-cased as needed and the punctuation handled according to user-specified rules.

Ideally, we could support some type of filter that would take in a Document and transform it according to whatever rules it wanted. This might be useful to incorporate with the token filter and IteratorFactory? Or it could be a step that exists totally in GenericMain?

Status: Accepted

Labels:
Type-Defect Priority-Low