My favorites | Sign in
Project Home Wiki Issues Source
READ-ONLY: This project has been archived. For more information see this post.
Search
for
MnistMapReader  
Parses the MNIST dataset into subsets, loaded in NumPy arrays
Updated Feb 4, 2010 by paulrei...@gmail.com

Introduction

The reference neural net task for the maaap reduce architecture is a handwritten digit classifier. Input is taken from the MNIST database, and buffered into a NumPy array. A DisCo map reader function returns chunks of the image array as input for each machine map stage. This is accomplished in three stages:

  1. Read the mnist binary file header to determine the size of each image and the number of images
  2. Supply a shape of (num_imgs, rows, cols) to the NumPy file loader
  3. Use a generator on the resultant NumPy array to read a portion of the MNIST file and trigger the NumPy file loader.

Testing on the entire dataset can be performed by omitting the third step, but most modern hardware cannot handle such a large dataset contained solely in main memory.

A description of each stage follows.

MNIST File Reader

import struct
def read_mnist_images(file):
  f = open(file)
  magic, num_imgs, num_rows, num_cols = struct.unpack('>iiii', f.read(4*4))
  assert magic == 2051, 'MNIST checksum failed'
  ...

The MNIST header file contains a hard-coded number as a simple checksum, followed by a 4-byte integer for:

  1. the number of images in the dataset,
  2. the number of rows of pixels in each image, and
  3. the number of columns of pixels in each image.

Standard values for the MNIST training set are 60,000 images, each 28x28 pixels.

NumPy File Loader

def read_mnist_images(file):
  ...
  shape = (num_imgs, num_cols, num_rows)

  imgs = numpy.fromfile(file=f, dtype=numpy.uint8).reshape(shape)
  f.close()

  return imgs

The numpy.fromfile method loads the entire dataset into a NumPy array with three axes: the image axis, the row axis, and the column axis.

DisCo Map Reader/Generator

def mnist_img_reader(fd, size, fname):
  # Disco function must be pure (ie no globals, ie no imports)
  if fd.tell() == 0:
    # Read the header
    magic, num_imgs, rows, cols = struct.unpack('>iiii', fd.read(4*4))
    # Basic checksum
    assert magic == 2051, 'MNIST checksum failed'

    pixels_per_img = cols*rows

  #imgs_in_batch = size / pixels_per_img
  imgs_in_batch = 60000/50
  #shape = (imgs_in_batch, num_cols, num_rows)
  shape = (imgs_in_batch, cols*rows)
  read_bytes = imgs_in_batch*pixels_per_img

  # Read the imgs_in_batch*rows*cols pixels from file
  # incrementing the array index for each image vector
  imgs = numpy.fromfile(file=fd, dtype=numpy.uint8, count=read_bytes).reshape(shape).astype(numpy.float32)
  if len(imgs) > 0:
    # The generator has something to return
    yield (imgs, len(imgs), rows, cols)

The DisCo map reader function performs similar steps to stages one and two described above, but produces a batch of images in contiguous form (i.e. one axis for the images, and one axis for all the pixels in each image). Each time the function is invoked it further consumes the input file until all the images have been allocated in a subset to at least one map task.

Powered by Google Project Hosting