|
Benchmarks
Comparing Python vs. C vs. CUDA for the sigmoid kernel
IntroductionIn order to verify the feasibility of using the GPU for a fairly substantial and rapidly changing dataset, a simple set of benchmark functions were created for three main programming language families. Each test evaluated every element in the MNIST dataset with the sigmoid function. LanguagesPythonThe following code uses a Python list comprehension to evaluate the sigmoid function for each element in the original images array, and append the results for that element to an array constructed in memory. start = time.time()
sigmoid = lambda x, mu, sigma: 1/(1+math.exp((x-mu)*sigma))
ans = [sigmoid(x, self.mu, self.sigma) for x in self.imgs.flat]
secs = time.time() - startTiming results for the reference system[1] are found near the bottom of this page. C (via NumPy array interface)The NumPy n-dimensional array interface uses an internal buffer provided by the low-level interface of the Python interpreter. A unique striding approach yields performance comparable with optimized C-language programs for any size set of data; this fast and stable interface is integrated into the core of Python 3.0. A key feature is the ability to broadcast an operation to each element in a very large array. The following code broadcasts a sigmoid operation for the MNIST dataset: start = time.time()
the_exp = (self.imgs - self.mu) * self.sigma
ans = 1/(1+numpy.exp(the_exp))
secs = time.time() - startCUDA (GPU implementation)The sigmoid kernel discussed in Kernels is executed for each element in the MNIST training database. Because the GPU is memory bound on the reference machine[1], the dataset is loaded into GPU global memory in subsets. The default number of splits is 16. The memory transfer operations are by far the slowest operations on the data, and so this models somewhat the practical case of loading and unloading a large dataset. start = cuda.Event()
end = cuda.Event()
gpu_arr = pycuda.gpuarray.GPUArray((len(self.imgs)/self.splits,28,28), numpy.float32)
gpu_out_arr = pycuda.gpuarray.empty_like(gpu_arr)
start.record()
for subset in range(0, self.splits):
gpu_arr.set(self.imgs[
(subset*len(self.imgs)/self.splits):((subset+1)*len(self.imgs)/self.splits)
].astype(numpy.float32))
sigmoid(gpu_arr, self.mu, self.sigma, gpu_out_arr)
end.record()
end.synchronize()
secs = start.time_till(end)*1e-3Results
The findings above show timings (in seconds) for executing the sigmoid function once for each pixel in the entire MNIST dataset (~47 million pixels). This is performed using integer arithmetic for CPU-bound languages (C, Python), and single-precision floating point numbers when executed on the GPU. The GPU time is also memory-bound by the relatively slow global CPU memory. The second set of execution times performs the same operation on a subset of data already loaded into memory. [1]Reference MachineThe machine used to compute the above benchmarks is equipped as follows:
| ||||||||||||
were the vectorized numpy functions used?
sorry, Missed the c reference numpy comment.