
openxmllib
About
openxmllib provides resources to handle OpenXML documents from Python.
OpenXML is the new office document format supported natively by MS Office 2007, and as import/export format by Apple iWork'08 and OpenOffice 2.2.
OpenXML is defined in the ECMA-376 standard.
openxmllib runs on any platform that supports Python 2.4, 2.5 or 2.6 and the lxml Python library.
New with openxmllib 1.0.7
zc.buildout friend
See more details here.
Features
- Extract indexable text from an OpenXML document.
- Extract meta data from an OpenXML document (author, title, ...)
Requirements
openxmllib requires the lxml 1.3.x or later library. See how to install it in lxml site.
Installation (from tarball)
Get the tarball with your browser, wget, curl or whatever, then:
$ tar xfz openxmllib-x.y.y.tar.gz
$ cd openxmllib-x.y.z
$ sudo python setup.py install
Installation from Cheeseshop
Even easier that the tarball:
$ [sudo] easy_install openxmllib
Note that this will install or upgrade a suited lxml
library if you've not already go it.
Usage
``` These examples say all::
import openxmllib doc = openxmllib.openXmlDocument(path='office.docx')
Raises a ValueError on not supported office files.
doc.mimeType 'application/vnd.openxmlformats-officedocument.wordprocessingml.document' doc.coreProperties # Keys may depend on application {'title': u'blah...', u'creator': u'John Doe', ...} doc.extendedProperties # Keys may depend on application {'Words': u'312', 'Application': u'Your favorite word processor', ...} doc.customProperties # May return an empty mapping {'My property': u'My value', ...} doc.allProperties # Merges core+extended+custom properties (see above) {...} doc.indexableText(include_properties=False) u'all the words of that document body' doc.indexableText(include_properties=True) u'all the words of that document body and all properties values'
Standard mimetypes
package extensions ::
import mimetypes mimetypes.guess_type('somedoc.docx') ('application/vnd.openxmlformats-officedocument.wordprocessingml.document', None) mimetypes.guess_type('somecalc.xlsx') ('application/vnd.openxmlformats-officedocument.spreadsheetml.sheet', None) mimetypes.guess_type('someslides.pptx') ('application/vnd.openxmlformats-officedocument.presentationml.presentation', None)
Document factory signatures::
We have the path for the office file
doc = openxmllib.openXmlDocument(path='office.docx')
We have a file object for the office file
fh = open('office.docx', 'rb') doc = openxmllib.openXmlDocument(file_='office.docx')
We have the URL for the office file
doc = openxmllib.openXmlDocument(url='http://domain.tld/office.docx')
Xe have the raw data of the office file
import mimetypes docx_mimetype = mimetypes.guess_type('office.docx') body = open('office.docx', 'rb').read() doc = open(data=body, mime_type=docx_mimetype)
Note that if you're not running a Python application, you may get the indexable
text from a document with the openxmlinfo.py
console utility. Just type::
```
Console tool
openxmllib comes with a shell command for testing of for use from a non Python app:
$ openxmlinfo --help
Planned features
- Transform OpenXML document to (poor) HTML for preview
- Transform OpenXML document to plain unicode text preview
Project Information
- License: GNU GPL v2
- 18 stars
- svn-based source control
Labels:
Python
OpenXML
lxml
Office2007