templatemaker

Python library for extracting data from similarly formatted text strings.

templatemaker

Given a list of text files in a similar format, templatemaker creates a template that can extract data from files in that same format.

The library is written in Python, but the underlying longest-common-substring algorithm is implemented in C for performance.

How to download

Go to the "Downloads" page and download the latest version, 0.1.1.

Newer (but not necessarily stable) code is available via Subversion on the "Source" page.

Example usage

Here's a sample Python interactive interpreter session:

```

Import the Template class.

from templatemaker import Template

Create a Template instance.

t = Template()

Learn a Sample String.

t.learn('this and that')

Output the template so far, using the "!" character to mark holes.

We've only learned a single string, so the template has no holes.

t.as_text('!') 'this and that'

Learn another string. The True return value means the template gained

at least one hole.

t.learn('alex and sue') True

Sure enough, the template now has some holes.

t.as_text('!') '! and !'

Learn another string. This time, the return value is False, which means

the template didn't gain any holes.

t.learn('fine and dandy') False

The template is the same as before.

t.as_text('!') '! and !'

Now that we have a template, let's extract some data.

t.extract('red and green') ('red', 'green') t.extract('django and stephane') ('django', 'stephane')

The extract() method is very literal. It doesn't magically trim

whitespace, nor does it have any knowledge of markup languages such as

HTML.

t.extract(' spacy and underlined') (' spacy ', 'underlined')

The extract() method will raise the NoMatch exception if the data

doesn't match the template. In this example, the data doesn't have the

leading and trailing "" tags.

t.extract('this and that') Traceback (most recent call last): ... NoMatch ```

Documentation

See README.TXT in the distribution for full documentation.

Stability

The library is functional, but this is my first time writing C code since college. As such, it may or may not have buffer-overflow issues. I'm hoping a C expert will step in and audit the code.

Do not use this in a production setting just yet.

Mailing list

The mailing list is hosted by Google Groups.

Academic stuff

Thanks to some kind contributors, I've learned that this sort of technology goes by several names in the academic community:

Wrapper induction ("wrapper" is a formal term for "screen scraper"). Every paper I've found about wrapper induction takes a "supervised" approach -- that is, it requires human-labeled input. My goal with templatemaker is to be entirely unsupervised.
Wrapper generation. (This seems to be a synonym for "wrapper induction.")
Information extraction (IE).
Template detection.

See these search results for hours of reading material.

Credits

This code was written by Adrian Holovaty and originally released July 5, 2007.

Project Information

License: New BSD License
36 stars
svn-based source control

Labels:
python text parsing extraction data

Code

Archive

templatemaker

templatemaker

How to download

Example usage

Import the Template class.

Create a Template instance.

Learn a Sample String.

Output the template so far, using the "!" character to mark holes.

We've only learned a single string, so the template has no holes.

Learn another string. The True return value means the template gained

at least one hole.

Sure enough, the template now has some holes.

Learn another string. This time, the return value is False, which means

the template didn't gain any holes.

The template is the same as before.

Now that we have a template, let's extract some data.

The extract() method is very literal. It doesn't magically trim

whitespace, nor does it have any knowledge of markup languages such as

HTML.

The extract() method will raise the NoMatch exception if the data

doesn't match the template. In this example, the data doesn't have the

leading and trailing "" tags.

Documentation

Stability

Mailing list

Academic stuff

Credits

Project Information