
templatemaker
templatemaker
Given a list of text files in a similar format, templatemaker creates a template that can extract data from files in that same format.
The library is written in Python, but the underlying longest-common-substring algorithm is implemented in C for performance.
How to download
Go to the "Downloads" page and download the latest version, 0.1.1.
Newer (but not necessarily stable) code is available via Subversion on the "Source" page.
Example usage
Here's a sample Python interactive interpreter session:
```
Import the Template class.
from templatemaker import Template
Create a Template instance.
t = Template()
Learn a Sample String.
t.learn('this and that')
Output the template so far, using the "!" character to mark holes.
We've only learned a single string, so the template has no holes.
t.as_text('!') 'this and that'
Learn another string. The True return value means the template gained
at least one hole.
t.learn('alex and sue') True
Sure enough, the template now has some holes.
t.as_text('!') '! and !'
Learn another string. This time, the return value is False, which means
the template didn't gain any holes.
t.learn('fine and dandy') False
The template is the same as before.
t.as_text('!') '! and !'
Now that we have a template, let's extract some data.
t.extract('red and green') ('red', 'green') t.extract('django and stephane') ('django', 'stephane')
The extract() method is very literal. It doesn't magically trim
whitespace, nor does it have any knowledge of markup languages such as
HTML.
t.extract(' spacy and underlined') (' spacy ', 'underlined')
The extract() method will raise the NoMatch exception if the data
doesn't match the template. In this example, the data doesn't have the
leading and trailing "" tags.
t.extract('this and that') Traceback (most recent call last): ... NoMatch ```
Documentation
See README.TXT in the distribution for full documentation.
Stability
The library is functional, but this is my first time writing C code since college. As such, it may or may not have buffer-overflow issues. I'm hoping a C expert will step in and audit the code.
Do not use this in a production setting just yet.
Mailing list
The mailing list is hosted by Google Groups.
Academic stuff
Thanks to some kind contributors, I've learned that this sort of technology goes by several names in the academic community:
- Wrapper induction ("wrapper" is a formal term for "screen scraper"). Every paper I've found about wrapper induction takes a "supervised" approach -- that is, it requires human-labeled input. My goal with templatemaker is to be entirely unsupervised.
- Wrapper generation. (This seems to be a synonym for "wrapper induction.")
- Information extraction (IE).
- Template detection.
See these search results for hours of reading material.
Credits
This code was written by Adrian Holovaty and originally released July 5, 2007.
Project Information
- License: New BSD License
- 36 stars
- svn-based source control
Labels:
python
text
parsing
extraction
data