|
RadixSorting
High performance GPU radix sorting in CUDA
News
Sorting OverviewThis project implements a very fast, efficient radix sorting method for CUDA-capable devices. For sorting large sequences of fixed-length keys (and values), we believe our sorting primitive to be the fastest available for any fully-programmable microarchitecture: our stock NVIDIA GTX480 sorting results exceed the 1G keys/sec average sorting rate (i.e., one billion 32-bit keys sorted per second). In addition, one of our design goals for this project is flexibility. We've designed our implementation to adapt itself and perform well on all generations and configurations of programmable NVIDIA GPUs, and for a wide variety of input types. Features:
Download and play around with the implementation yourself! Radix SortingThe radix sorting method works by iterating over the d-bit digit-places of the keys from least-significant to most-significant. For each digit-place, the method performs a stable distribution sort of the keys based upon their digit at that digit-place. Given an n-element sequence of k-bit keys and a radix r = 2d, a radix sort of these keys will require k/d iterations of a distribution sort over all n keys. The distribution sort (a.k.a. counting sort) is the fundamental component of the radix sorting method. In a data-parallel, shared-memory model of computation, each logical processor gathers its key, decodes the specific digit at the given digit-place, and then must cooperate with other processors to determine where the key should be relocated. The relocation offset will be the key's global rank, i.e., the number of keys with lower digits at that digit place plus the number of keys having the same digit, yet occurring earlier in the input sequence. We implement these distribution sorting passes using a very efficient implementation of a generalized parallel prefix scan. Our generalized scan is designed to operate over multiple, concurrent scan problems. For example, with d = 4 bits (r = 16 digits), our multi-scan does sixteen scan operations: one for each digit. For each of the scan operations (e.g., the 0s scan, the 1s scan, the 2s scan, etc.), the input for each key is a 1 if the key's digit place contains that operation's digit, 0 otherwise. When the mulit-scan is done, the logical processor for each key can look up the scan result from the appropriate scan operation to determine where its key should be placed. Authors' RequestIf you use/reference/benchmark this code, please cite our PPL Journal Article: D. Merrill and A. Grimshaw, “High Performance and Scalable Radix Sorting: A case study of implementing dynamic parallelism for GPU computing,” Parallel Processing Letters, vol. 21, no. 2, pp. 245-272, 2011. Bibtex: @article{
title = {High Performance and Scalable Radix Sorting: A case study of implementing dynamic parallelism for {GPU} computing},
volume = {21},
issn = {0129-6264},
shorttitle = {High Performance and Scalable Radix Sorting},
url = {http://www.worldscinet.com/ppl/21/2102/S0129626411000187.html},
doi = {10.1142/S0129626411000187},
number = {02},
journal = {Parallel Processing Letters},
author = {Duane Merrill and Andrew Grimshaw},
year = {2011},
pages = {245--272}
}
PerformanceThe following figures and tables present sorting rates for various CUDA GPUs. These results were measured using a suite of ~3,000 randomly-sized input sequences (sized 32 - 272M elements), each initialized with keys and values whose bits were sampled from a uniformly random distribution. For a discussion of why these results represent a lower-bound on sorting performance (i.e., real-world sorting rates may be higher), see our evaluation of key entropy. Our measurements for elapsed time were taken directly from GPU hardware performance counters. Therefore these results are reflective of in situ sorting problems: they preclude the driver overhead and the overheads of staging data to/from the accelerator, allowing us to directly contrast the individual and cumulative performance of the stream kernels involved. Please note that the following measurements were made using the Cuda 3.0 compiler and driver framework. (In our evaluation, code generated by the 3.0 compiler is ~0.5% faster than that generated by the 3.1 compiler.) For Fermi-class devices, we compiled our device kernels for 32-bit device pointers. (See the section on building and adapting below). We invite the reader to compare our results with the current state-of-the-art for GPUs and CPUs (and for experimental architectures), e.g.:
Selected Performance PlotsSorting rates as a function of problem size for 32-bit uint keys: And for 32-bit uint keys paired with 32-bit values: Average Saturated Sorting RatesUnless otherwise noted, the following tables present average saturated sorting rates for problem sizes greater than or equal to 32M unsigned keys. Our implementation is also capable of sorting signed and floating-point key types (e.g., float, int, double, etc.) at a 0.5% - 1.5% performance overhead.
� 16M+ elements (restricted by global memory size)
Building and AdaptingThe current stable version is: 1.0.256 (SVN revision 256, 11/06/2010). If you are experiencing correctness issues with the current CUDA compiler toolchain, please update to the latest version. You can obtain the sources from either: This subproject can be found in the /trunk/LsbRadixSort subtree. It contains C++/CUDA code for:
Like Thrust, the sources for our implementation are designed to be included and built as part of larger applications (as opposed to linked in as a shared or static library). This prevents kernel bloat and allows the implementation to tailor itself to your data types. The subproject sources require compilation using CUDA Toolkit version 3.0 or higher. The repository includes a Makefile to build the example programs on Linux. This Makefile can be used as a guide for constructing an equivalent VC++ project for evaluation on Windows systems. When performance benchmarking our implementaion, we suggest that you compile the kernels with the following options:
Cheers! | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||