|
GuideToCompression
How to use chunking and compression and still get good performance
Transparent compression and error detection are some of the most useful components in the HDF5 library. However, a few pitfalls await the casual user of these features. This document is a quick overview of how chunking and compression work in HDF5, and how you can make best use of them from h5py. If your compressed datasets are unexpectedly slow, you should read this. Contents The mysteries of chunked dataHow compression worksUncompressed data in HDF5 is (typically) stored in contiguous space; a large block of the file is reserved and elements are written one after another. It's not feasible to compress data stored this way; for one thing, it would make random access impossible, or at least very slow. HDF5 solves this by adding another storage strategy: chunked storage. Instead of storing data in one continuous lump, it is divided into discrete "chunks" which are then indexed in the file using a B-tree. When a slice is read from the dataset, the proper chunks are recovered from the file and data extracted from them. All operations in HDF5 like compression and error detection are implemented as filters which operate on chunks of data. For example, a dataset compressed with gzip and checksummed using fletcher32 is passed through a filter pipeline which first calls a function to compress the data, and then another to compute the checksum. In this way a number of different transparent "effects" can be applied to a dataset. Choosing chunk dimensionsThe size of the chunks used is user-specified and fixed when the dataset is created. However, an additional choice must be made. If the dataset is 1-D, it can simply be sliced up into equally-sized parts. But if the dataset is 2-D or higher, the shape of the chunks must be decided upon. Suppose we want to store an array of twenty 1000-point time traces. The dimensions of the dataset are (20, 1000). We decide we want to use a chunk size of 1000 elements. Which of the following chunk shapes is best? A) (1, 1000) B) (20, 50) To illustrate, consider two different applications reading from this dataset. The first reads out the leading 200 points of the first time trace. The second reads the first 10 points of all twenty records. Both applications read a total of 200 points. Which is likely to be faster? When using chunk shape A, the data requested by the first application lies within a single chunk. HDF5 reads (and decompresses) 1000 elements. The second application needs to access part of the data for each trace; it needs to read all 20 chunks, and decompress fully 20,000 elements to get the 200 it wants. But when using chunk shape B, the tables are turned. In order to read the first 200 elements of the first trace, the first application has to read in 4 chunks, since each only stores 50 elements along the time axis. By contrast, the second application can read the first 10 elements of all the traces in a single chunk. So the surprising answer is that the correct chunk shape to use depends on the expected access pattern. There is no such thing as "the best" chunk shape, only the best for your intended application. In particular, chunked storage defeats C-style intuition about which elements are close together and which are far apart. How h5py handles chunk shapeChunked storage is required for compressed data. The C interface to HDF5 demands that you explicitly specify a chunk shape when you create a dataset. When you create a dataset, it's recommended that you explicitly specify the chunk shape via keyword argument chunks=(shape tuple). However, in the spirit of ease of use, h5py will attempt to guess a chunk shape for you if you don't specify it. Since h5py has no way to know how you intend to access the data, it takes a "middle of the road" approach and tries to make the chunks as square as possible, relative to the size of each dataset dimension. In this example, if requested to provide a shape appropriate for a 1000 element chunk on a (20,1000) dataset, it might guess (4, 250). The two applications described would end up decompressing 1000 and 5000 elements respectively; not perfect, but not the worse-case scenario of 20000. Manually picking a chunk shapeWhen manually choosing a chunk shape for your dataset, try to stick to the following guidelines:
Real-world consequences of mis-chunkingIf an mismatched chunk shape is used, the effect on I/O performance can be dramatic. For example, consider an application writing image files to disk. The shape of the dataset is (1000, 1024, 1280), 4 bytes per pixel, for a grand total of 4.8 GB of data. The default chunk size guessed by h5py is (32, 32, 40). What happens when the application tries to write a batch of images to file by calling repeatedly calling dset[idx, :, :] = imagedata? Each image frame requires (1024/32)*(1280/40) = 1024 chunks to be processed. Each chunk is 32*32*40*4 = 160k, for a total of about 160 MB of data, which has to be shuffled around to write a single image! Of course this amount of data can't fit in cache, so every time an image is written, 160 MB of data is read from disk, decompressed, modified, recompressed, and written back. In other words, since the chunking dimension on the first axis is 32, we process every single chunk 32 times. Now imagine a chunking shape of (1, 128, 320). The amount of data processed per image is just that of a single image (5 MB). Each chunk is processed exactly once; in fact, it is only written, not read, in contrast to the previous case in which each chunk is written 32 times and read 31 times. Assuming compression and decompression take very roughly equal amounts of time, this is more than a 60x speed-up. More informationThe HDF5 User's Guide has an excellent discussion of chunking and compression, although from a C perspective. |