|
Project Information
|
A scalable parallel software for detecting dense subgraphs.Download SVN checkcout: $ svn checkout http://pclust.googlecode.com/svn/trunk/ pclust-read-only Download: http://code.google.com/p/pclust/downloads/list Reference C. Wu, A. Kalyanaraman. An efficient parallel approach for identifying protein families in large-scale metagenomic data sets. Proc. ACM/IEEE Supercomputing Conference (SC'08), Austin, TX, November 15-21. pp. 1-10. 2008 Abstract*—Metagenomics is the study of environmental micro- bial communities using state-of-the-art genomic tools. Recent advancements in high-throughput technologies have enabled the accumulation of large volumes of metagenomic data that was until a couple of years back was deemed impractical for generation. A primary bottleneck, however, is in the lack of scalable algorithms and open source software for large- scale data processing. In this paper, we present the design and implementation of a novel parallel approach to identify protein families from large-scale metagenomic data. Given a set of peptide sequences we reduce the problem to one of de- tecting arbitrarily-sized dense subgraphs from bipartite graphs. Our approach efficiently parallelizes this task on a distributed memory machine through a combination of divide-and-conquer and combinatorial pattern matching heuristic techniques. We present performance and quality results of extensively testing our implementation on 160K randomly sampled sequences from the CAMERA environmental sequence database using 512 nodes of a BlueGene/L supercomputer. |