My favorites | Sign in
Project Home Wiki Issues Source
READ-ONLY: This project has been archived. For more information see this post.
Search
for
Benchmarks  
Comparing Python vs. C vs. CUDA for the sigmoid kernel
Updated Feb 4, 2010 by paulrei...@gmail.com

Introduction

In order to verify the feasibility of using the GPU for a fairly substantial and rapidly changing dataset, a simple set of benchmark functions were created for three main programming language families. Each test evaluated every element in the MNIST dataset with the sigmoid function.

Languages

Python

The following code uses a Python list comprehension to evaluate the sigmoid function for each element in the original images array, and append the results for that element to an array constructed in memory.

    start = time.time()

    sigmoid = lambda x, mu, sigma: 1/(1+math.exp((x-mu)*sigma))

    ans = [sigmoid(x, self.mu, self.sigma) for x in self.imgs.flat]
    secs = time.time() - start

Timing results for the reference system[1] are found near the bottom of this page.

C (via NumPy array interface)

The NumPy n-dimensional array interface uses an internal buffer provided by the low-level interface of the Python interpreter. A unique striding approach yields performance comparable with optimized C-language programs for any size set of data; this fast and stable interface is integrated into the core of Python 3.0.

A key feature is the ability to broadcast an operation to each element in a very large array. The following code broadcasts a sigmoid operation for the MNIST dataset:

    start = time.time()
    the_exp = (self.imgs - self.mu) * self.sigma
    ans = 1/(1+numpy.exp(the_exp))
    secs = time.time() - start

CUDA (GPU implementation)

The sigmoid kernel discussed in Kernels is executed for each element in the MNIST training database. Because the GPU is memory bound on the reference machine[1], the dataset is loaded into GPU global memory in subsets. The default number of splits is 16. The memory transfer operations are by far the slowest operations on the data, and so this models somewhat the practical case of loading and unloading a large dataset.

    start = cuda.Event()
    end = cuda.Event()

    gpu_arr = pycuda.gpuarray.GPUArray((len(self.imgs)/self.splits,28,28), numpy.float32)
    gpu_out_arr = pycuda.gpuarray.empty_like(gpu_arr)

    start.record()
    for subset in range(0, self.splits):
      gpu_arr.set(self.imgs[
        (subset*len(self.imgs)/self.splits):((subset+1)*len(self.imgs)/self.splits)
        ].astype(numpy.float32))

      sigmoid(gpu_arr, self.mu, self.sigma, gpu_out_arr)

    end.record()
    end.synchronize()
    
    secs = start.time_till(end)*1e-3

Results

Language Entire Dataset (s) Subset in Memory (s)
CUDA (GPU) 0.4849 0.002626
NumPy (CPU) 1.963 0.127
Python (CPU) 29.97 814.05

The findings above show timings (in seconds) for executing the sigmoid function once for each pixel in the entire MNIST dataset (~47 million pixels). This is performed using integer arithmetic for CPU-bound languages (C, Python), and single-precision floating point numbers when executed on the GPU. The GPU time is also memory-bound by the relatively slow global CPU memory. The second set of execution times performs the same operation on a subset of data already loaded into memory.

[1]Reference Machine

The machine used to compute the above benchmarks is equipped as follows:

  1. 2.4 GHz MacBook Pro Core 2 Duo
  2. 4GB ram
  3. nVidia GeForce 8600M GT (256 MB, PCIe)
  4. Python 2.6
  5. Numpy 1.3
  6. CUDA 2.1

Comment by tomfitzr...@gmail.com, Nov 10, 2009

were the vectorized numpy functions used?

Comment by tomfitzr...@gmail.com, Nov 10, 2009

sorry, Missed the c reference numpy comment.

Powered by Google Project Hosting