|
MnistMapReader
Parses the MNIST dataset into subsets, loaded in NumPy arrays
IntroductionThe reference neural net task for the maaap reduce architecture is a handwritten digit classifier. Input is taken from the MNIST database, and buffered into a NumPy array. A DisCo map reader function returns chunks of the image array as input for each machine map stage. This is accomplished in three stages:
Testing on the entire dataset can be performed by omitting the third step, but most modern hardware cannot handle such a large dataset contained solely in main memory. A description of each stage follows. MNIST File Readerimport struct
def read_mnist_images(file):
f = open(file)
magic, num_imgs, num_rows, num_cols = struct.unpack('>iiii', f.read(4*4))
assert magic == 2051, 'MNIST checksum failed'
...The MNIST header file contains a hard-coded number as a simple checksum, followed by a 4-byte integer for:
Standard values for the MNIST training set are 60,000 images, each 28x28 pixels. NumPy File Loaderdef read_mnist_images(file): ... shape = (num_imgs, num_cols, num_rows) imgs = numpy.fromfile(file=f, dtype=numpy.uint8).reshape(shape) f.close() return imgs The numpy.fromfile method loads the entire dataset into a NumPy array with three axes: the image axis, the row axis, and the column axis. DisCo Map Reader/Generatordef mnist_img_reader(fd, size, fname):
# Disco function must be pure (ie no globals, ie no imports)
if fd.tell() == 0:
# Read the header
magic, num_imgs, rows, cols = struct.unpack('>iiii', fd.read(4*4))
# Basic checksum
assert magic == 2051, 'MNIST checksum failed'
pixels_per_img = cols*rows
#imgs_in_batch = size / pixels_per_img
imgs_in_batch = 60000/50
#shape = (imgs_in_batch, num_cols, num_rows)
shape = (imgs_in_batch, cols*rows)
read_bytes = imgs_in_batch*pixels_per_img
# Read the imgs_in_batch*rows*cols pixels from file
# incrementing the array index for each image vector
imgs = numpy.fromfile(file=fd, dtype=numpy.uint8, count=read_bytes).reshape(shape).astype(numpy.float32)
if len(imgs) > 0:
# The generator has something to return
yield (imgs, len(imgs), rows, cols)The DisCo map reader function performs similar steps to stages one and two described above, but produces a batch of images in contiguous form (i.e. one axis for the images, and one axis for all the pixels in each image). Each time the function is invoked it further consumes the input file until all the images have been allocated in a subset to at least one map task. |