My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
FAQ  
Design of h5py and supported features
Featured
Updated Jun 19, 2011 by andrew.c...@gmail.com

See CommonProblems for performance hints and error messages
See Roadmap for upcoming releases and development strategy
Main documentation is at http://h5py.alfven.org/docs

Contents

General design

Obtaining and installing h5py

Please follow the installation instructions in the h5py documentation.

What datatypes are supported?

Below is a complete list of types for which h5py supports reading, writing and creating datasets. Each type is mapped to a native NumPy type.

Fully supported types:

Type Precision Notes Introduced
Integer 1, 2, 4 or 8 byte, BE or LE, signed/unsigned Original
Float 4 or 8 byte, BE or LE, IEEE Original
Complex 8 or 16 byte, BE or LE Stored as 2-element HDF5 struct Original
Compound Arbitrary names and offsets Original
Strings (fixed-length) Any length Original
Strings (variable-length) Any length Read/write as Python "str" Version 1.2
Opaque (kind 'V') Any length Original
Boolean NumPy 1-byte bool Stored as HDF5 enum Version 1.2
Array Any supported type Version 1.2
Enumeration Any NumPy integer type Read/write as integers Version 1.2
References Region and object Version 1.3

Unsupported types:

Type Will support? Notes
HDF5 VLEN (non-string) Yes, eventually
HDF5 "time" type No target The NumPy team is working on a time type
NumPy unicode ("U") No target No close HDF5 equivalent
NumPy generic object ("O") No target

Please note that the "unsupported" HDF5 type objects can still be manipulated through the low-level interface, although reading/writing data of these types is not currently possible.

What compression/processing filters are supported?

The following filters are available:

Filter Function Availability
GZIP Standard compression All HDF5 platforms
SHUFFLE Increase compress ratio All HDF5 platforms
FLETCHER32 Error detection All HDF5 platforms
SZIP Faster compression, limited types If HDF5 is compiled with szip support
LZF Very fast compression, all types Ships with h5py, standalone C source available

Datasets can be read, written and created with these filters from both the low- and high-level interfaces.

The LZF compression filter is a new component which provides extremely fast compression (many times faster than GZIP) at the cost of a lower compresion ratio. This filter is part of h5py and does not need to be installed separately. Benchmark results comparing LZF to GZIP and SZIP are available at the main h5py site.

A standalone C implementation of the LZF filter is also available in the h5py tarball, for those who wish to include it in other HDF5 applications. Like h5py, it is released under the BSD license.

What file drivers are available?

Starting with h5py 1.2, a number of different low-level HDF5 file drivers are made accessible via the high-level interface. The currently supported drivers are:

Driver Purpose Notes
sec2 Standard optimized driver Default driver on UNIX
stdio Driver using functions from stdio.h
core In-memory HDF5 file, optionally backed to disk Loading an existing file requires HDF5 1.8
family Multi-file driver Split an HDF5 file into equal-sized chunks
windows Windows-specific driver Default driver on Windows

What's the difference between h5py and PyTables?

The two projects have different design goals. PyTables presents a database-like approach to data storage, providing features like indexing and fast "in-kernel" queries on dataset contents. It also has a custom system to represent data types.

In contrast, h5py is an attempt to map the HDF5 feature set to NumPy as closely as possible. For example, the high-level type system uses NumPy dtype objects exclusively, and method and attribute naming follows Python and NumPy conventions for dictionary and array access (i.e. ".dtype" and ".shape" attributes for datasets, obj[name] indexing syntax for groups, etc).

H5py also provides access to nearly all of the HDF5 C API. The fundamental platform of h5py is a near-complete wrapping of the HDF5 API via Cython code. This layer is object-oriented with respect to HDF5 identifiers, supports reference counting, automatic translation between NumPy and HDF5 type objects, translation between the HDF5 error stack and Python exceptions, and more.

In fact, the "high-level" interface to h5py (i.e. NumPy-array-like objects; what you'll typically be using) is a thin native-Python layer which calls in to this API. This greatly simplifies the design of the complicated high-level interface, by relying on the "Pythonicity" of the C API wrapping.

There's also a PyTables perspective on this question at the PyTables FAQ.

Parallel HDF5

Currently h5py does not support the parallel version of the HDF5 library, which is based on MPI-IO. However, this doesn't mean you can't use h5py in a multi-process program. The multiprocessing module (new in Python 2.6) provides an excellent way to get process-level parallelism in Python.

The only caveat is that you must be careful to modify any particular file from one process only.

There's a multiprocessing example in the h5py source distribution.

Variable-length (VLEN) data

VLEN strings are supported as of h5py 1.2. However, generic (non-string) VLEN data cannot yet be processed by h5py. Please note that NULL bytes are not allowed in vlen strings.

Enumerated types

HDF5 enumerated types are supported as of h5py 1.2. As NumPy has no native enum type, they are treated on the Python side as integers.

NumPy object types

Storage of generic objects (NumPy dtype "O") is not implemented at the moment, although Python strings can be stored as native HDF5 vlen strings. In the meantime, consider pickling objects (to ASCII) and storing them as vlen strings.

Appending data to a dataset

The short response is that h5py is NumPy-like, not database-like. Unlike the HDF5 packet-table interface (and PyTables), there is no concept of appending rows. Rather, you can expand the shape of the dataset to fit your needs. For example, if I have a series of time traces 1024 points long, I can create an extendable dataset to store them:

>>> dset = myfile.create_dataset("MyDataset", (10, 1024), maxshape=(None, 1024))
>>> dset.shape
(10,1024)

The keyword argument "maxshape" tells HDF5 that the first dimension of the dataset can be expanded to any size, while the second dimension is limited to a maximum size of 1024. We create the dataset with room for an initial ensemble of 10 time traces. If we later want to store 10 more time traces, the dataset can be expanded along the first axis:

>>> dset.resize(20, axis=0)   # or dset.resize((20,1024))
>>> dset.shape
(20, 1024)

Each axis can be resized up to the maximum values in "maxshape". Things to note:

  1. Unlike NumPy arrays, when you resize a dataset the indices of existing data do not change; each axis grows or shrinks independently
  2. The dataset rank (number of dimensions) is fixed when it is created

Unicode

As of h5py 2.0.0, Unicode is supported for file names as well as for object in the file. When object names are read, they are returned as Unicode by default.

However, HDF5 has no predefined datatype to represent fixed-width UTF-16 or UTF-32 (NumPy format) strings. Therefore, the NumPy 'U' datatype is not currently supported.

Citing h5py

H5py is made available under a permissive license and maintained for free in the spirit of academic cooperation. Authors who use h5py as a substantial part of their work are warmly encouraged to acknowledge its developer via formal citation. The following information is preferred:

e.g:

A. Collette, HDF5 for Python, 2008 (http://h5py.alfven.org)

Development

Building from Mercurial

We now use Mercurial to manage changes at Google Code. Here's how to build h5py from source:

  • Clone the project:
  • $ hg clone http://h5py.googlecode.com/hg h5py
  • Generate the Cython files which talk to HDF5:
  • $ cd h5py/h5py
    $ python api_gen.py
  • Build the project (this step also auto-compiles the .c files)
  • $ cd ..
    $ python setup.py build [--hdf5=/path/to/hdf5]
  • Run the unit tests
  •     $ python setup.py test
  • Report any failing tests to the mailing list (h5py at googlegroups), or by filing a bug report at h5py.googlecode.com.

Sign in to add a comment
Powered by Google Project Hosting