IntroductionUnder the SIMD (single-instruction, multiple-data) paradigm, multiple threads execute the same kernel, which should be designed to avoid conflicting, simultaneous data access. A simple kernel performs a single operation on each element of a vector (i.e. a scalar multiplication), but more advanced kernels can use branching logic and operations of any arity. Two essential kernels are described here, the first using a PyCuda ElementwiseKernel, and the second using the low level CUDA driver language. Sigmoid FunctionThe sigmoid is a form of the logistic function, with the useful property of scaling an input in the range (-Inf, Inf) to (0, -1). In addition to use in statistics, it is a simple mathematical model for a neuron activation function. This post processing step is applied after each neuron has processed an weighted input stimulus.
The Sigmoid Function Element-wise Sigmoid Kernelfrom pycuda.elementwise import ElementwiseKernel
sigmoid = ElementwiseKernel(
"float *x, float mu, float sigma, float *y",
"y[i] = 1/( 1 + expf((x[i]-mu) * sigma) )",
"sigmoid")
gpu_arr = pycuda.gpuarray.to_gpu(numpy_arr)
result = pycuda.gpuarray.empty_like(numpy_arr)
sigmoid(gpu_arr, 1.0, 0.0, result)Here a hybrid of the CUDA driver API and the PyCuda Python syntax is used to create an efficient kernel for calculating the sigmoid function for each element in an array x. The first argument is a string with the C-typed variables used within the kernel. The second argument is the actual kernel description, that operates on the arguments to the kernel returning a single element result. The index i is initialized to identify each thread in the parallel execution, such that each thread accesses only the data designated to it. The first time the kernel is invoked, PyCuda fills a function template with the provided strings, and compiles the kernel using nVidia nvcc compiler. When the kernel is called with a vectors loaded on the GPU, the compiled kernel is loaded into the current CUDA processing context, and a thread grid is allocated using the largest number of thread available to the GPU. Matrix MultiplicationThe Perceptron algorithm is succinctly described as a matrix multiplication C=A*B, where A is a matrix of input instances, one instance per row. B is matrix of layer weights, with a height equal to the number inputs per instance in matrix A, and a width equal to the number of neurons in the neural net layer. Each input in an instance row is connected to each neuron in the next hidden layer, with the corresponding weights as the column vectors of B. The result matrix C has a height equal to the number of input instances, and a width equal to the number of neurons in the neural net layer for this iteration. It is clear to see that this matrix can used as inputs for the next hidden layer, reducing or expanding the layer size in each iteration. In general, each result matrix C is normalized by the sigmoid function to maintain stability as the network progresses. Special care is taken to construct a high-performance kernel for matrix multiplication on the GPU. Parallel Matrix Multiplication KernelA general algorithm for parallel matrix multiplication is provided in the CUDA reference projects. The algorithm takes as operands three float matrices for each of A, B, and C, and the explicit widths of matrix A and B (from which the dimensions of C are derived.) A conservative thread block size of 16 is used; the algorithm is therefore much more efficient when aligned to blocks of size 16, padded with 0s if necessary. Each thread in each block strides through the rows in global memory for A and the columns in global memory for B, and stores these elements into a subblock stored in shared memory. In the second processing phase of each thread, the dot product is performed for an element in the subblock matrix. These dot products constitute the results of the matrix multiplication. #define BLOCK_SIZE 16
__global__ void
matrix_mul( float* A, float* B, int wA, int wB, float* C )
{
// Block index
int bx = blockIdx.x;
int by = blockIdx.y;
// Thread index
int tx = threadIdx.x;
int ty = threadIdx.y;
// Index of the first sub-matrix of A processed by the block
int aBegin = wA * BLOCK_SIZE * by;
// Index of the last sub-matrix of A processed by the block
int aEnd = aBegin + wA - 1;
// Step size used to iterate through the sub-matrices of A
int aStep = BLOCK_SIZE;
// Index of the first sub-matrix of B processed by the block
int bBegin = BLOCK_SIZE * bx;
// Step size used to iterate through the sub-matrices of B
int bStep = BLOCK_SIZE * wB;
// Csub is used to store the element of the block sub-matrix
// that is computed by the thread
float Csub = 0;
// Loop over all the sub-matrices of A and B
// required to compute the block sub-matrix
for (int a = aBegin, b = bBegin;
a <= aEnd;
a += aStep, b += bStep) {
// Declaration of the shared memory array As used to
// store the sub-matrix of A
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
// Declaration of the shared memory array Bs used to
// store the sub-matrix of B
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
// Load the matrices from device memory
// to shared memory; each thread loads
// one element of each matrix
As[ty][tx] = A[a + wA * ty + tx];
Bs[ty][tx] = B[b + wB * ty + tx];
// Synchronize to make sure the matrices are loaded
__syncthreads();
// Multiply the two matrices together;
// each thread computes one element
// of the block sub-matrix
for (int k = 0; k < BLOCK_SIZE; ++k)
Csub += As[ty][k] * Bs[k][tx];
// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
__syncthreads();
}
// Write the block sub-matrix to device memory;
// each thread writes one element
int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;
C[c + wB * ty + tx] = Csub;
}
|