|
FAQ
Design of h5py and supported features
Featured
Contents
General designObtaining and installing h5pyPlease follow the installation instructions in the h5py documentation. What datatypes are supported?Below is a complete list of types for which h5py supports reading, writing and creating datasets. Each type is mapped to a native NumPy type. Fully supported types:
Unsupported types:
Please note that the "unsupported" HDF5 type objects can still be manipulated through the low-level interface, although reading/writing data of these types is not currently possible. What compression/processing filters are supported?The following filters are available:
Datasets can be read, written and created with these filters from both the low- and high-level interfaces. The LZF compression filter is a new component which provides extremely fast compression (many times faster than GZIP) at the cost of a lower compresion ratio. This filter is part of h5py and does not need to be installed separately. Benchmark results comparing LZF to GZIP and SZIP are available at the main h5py site. A standalone C implementation of the LZF filter is also available in the h5py tarball, for those who wish to include it in other HDF5 applications. Like h5py, it is released under the BSD license. What file drivers are available?Starting with h5py 1.2, a number of different low-level HDF5 file drivers are made accessible via the high-level interface. The currently supported drivers are:
What's the difference between h5py and PyTables?The two projects have different design goals. PyTables presents a database-like approach to data storage, providing features like indexing and fast "in-kernel" queries on dataset contents. It also has a custom system to represent data types. In contrast, h5py is an attempt to map the HDF5 feature set to NumPy as closely as possible. For example, the high-level type system uses NumPy dtype objects exclusively, and method and attribute naming follows Python and NumPy conventions for dictionary and array access (i.e. ".dtype" and ".shape" attributes for datasets, obj[name] indexing syntax for groups, etc). H5py also provides access to nearly all of the HDF5 C API. The fundamental platform of h5py is a near-complete wrapping of the HDF5 API via Cython code. This layer is object-oriented with respect to HDF5 identifiers, supports reference counting, automatic translation between NumPy and HDF5 type objects, translation between the HDF5 error stack and Python exceptions, and more. In fact, the "high-level" interface to h5py (i.e. NumPy-array-like objects; what you'll typically be using) is a thin native-Python layer which calls in to this API. This greatly simplifies the design of the complicated high-level interface, by relying on the "Pythonicity" of the C API wrapping. There's also a PyTables perspective on this question at the PyTables FAQ. Parallel HDF5Currently h5py does not support the parallel version of the HDF5 library, which is based on MPI-IO. However, this doesn't mean you can't use h5py in a multi-process program. The multiprocessing module (new in Python 2.6) provides an excellent way to get process-level parallelism in Python. The only caveat is that you must be careful to modify any particular file from one process only. There's a multiprocessing example in the h5py source distribution. Variable-length (VLEN) dataVLEN strings are supported as of h5py 1.2. However, generic (non-string) VLEN data cannot yet be processed by h5py. Please note that NULL bytes are not allowed in vlen strings. Enumerated typesHDF5 enumerated types are supported as of h5py 1.2. As NumPy has no native enum type, they are treated on the Python side as integers. NumPy object typesStorage of generic objects (NumPy dtype "O") is not implemented at the moment, although Python strings can be stored as native HDF5 vlen strings. In the meantime, consider pickling objects (to ASCII) and storing them as vlen strings. Appending data to a datasetThe short response is that h5py is NumPy-like, not database-like. Unlike the HDF5 packet-table interface (and PyTables), there is no concept of appending rows. Rather, you can expand the shape of the dataset to fit your needs. For example, if I have a series of time traces 1024 points long, I can create an extendable dataset to store them: >>> dset = myfile.create_dataset("MyDataset", (10, 1024), maxshape=(None, 1024))
>>> dset.shape
(10,1024)The keyword argument "maxshape" tells HDF5 that the first dimension of the dataset can be expanded to any size, while the second dimension is limited to a maximum size of 1024. We create the dataset with room for an initial ensemble of 10 time traces. If we later want to store 10 more time traces, the dataset can be expanded along the first axis: >>> dset.resize(20, axis=0) # or dset.resize((20,1024)) >>> dset.shape (20, 1024) Each axis can be resized up to the maximum values in "maxshape". Things to note:
UnicodeAs of h5py 2.0.0, Unicode is supported for file names as well as for object in the file. When object names are read, they are returned as Unicode by default. However, HDF5 has no predefined datatype to represent fixed-width UTF-16 or UTF-32 (NumPy format) strings. Therefore, the NumPy 'U' datatype is not currently supported. Citing h5pyH5py is made available under a permissive license and maintained for free in the spirit of academic cooperation. Authors who use h5py as a substantial part of their work are warmly encouraged to acknowledge its developer via formal citation. The following information is preferred:
e.g: A. Collette, HDF5 for Python, 2008 (http://h5py.alfven.org) DevelopmentBuilding from MercurialWe now use Mercurial to manage changes at Google Code. Here's how to build h5py from source:
$ hg clone http://h5py.googlecode.com/hg h5py
$ cd h5py/h5py $ python api_gen.py
$ cd .. $ python setup.py build [--hdf5=/path/to/hdf5]
$ python setup.py test
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||