pyrohmmsnp


PyroHMMsnp: A realignment-based SNP calling method for 454 and Ion Torrent sequencing data

Content

[Date:2014-9-1] Both PyroHMMsnp and PyroHMMvar will be integrated into the toolkit PyroTools and moved to GitHub.

Introduction

PyroHMMsnp is a Bayesian SNP calling method for 454 and Ion Torrent sequencing data. PyroHMMsnp employs a hidden markov model(HMM) to perform the realignment on the raw mapping results generated by the mapping program, i.e. MegaBlast, SSAHA2, BWA-SW, Mosaik, SMALT and detects SNPs lying on the realignments. PyroHMMsnp obtains high sensitivity of the detection of putative SNPs and the supreme specificity on 454 and Ion Torrent resequencing data.

Homopolymer

A homopolymer is a k-mer strech that is made up of a nucleotide. For example, a homopolymer, of which the length is 3 nt, is TTT. A homopolymer is denoted as a 2-element tuple, one element represents the nucleotide of the homopolymer, the other element is the length of the homopolymer. The above homopolymer, TTT, is represented as .

Homopolymer Sequencing

We use the term "Homopolymer Sequencing" to refer to the 454 and Ion Torrent technologies which share the common underlying characteristics of sequencing a homopolymer in a reagent cycle. Following the run length encoding (RLE) rule, a nucleotide sequence, e.g. ACGGGTT, is converted to a homopolymer sequence, like .

Error Model

We develop an HMM to formulate the sequencing errors in the homopolymer sequencing techniques, i.e. the correct-calls, over-calls, under-calls, insertions and deletions. The HMM inputs a homopolymer sequence transformed from the original nucleotide sequence by the run-length encoding rule and emits a nucleotide sequence which resembles the real sequencing read. The HMM is parameterised by the EM algorithm which is proposed in our article using the resequencing data.

Realignment

We use the HMM to perform the realignments around the variant sites by using the Viterbi algorithm.

SNP Calling

We use the Bayesian calling method to detect the single nucleotide polymorphic sites based on the realignments.

Usage

pyrohmmsnp -b <MAPPING_FILE> -f <REFERENCE_GENOME> -w <WINDOW_WIDTH> -m <HMM_PARAMETERS> -t <THRESHOLD>

Parameter Description

|-b|The input alignment file in BAM format (indexed by Samtools)| |:-|:-----------------------------------------------------------------------------------------------| |-f|The reference file in FASTA format (indexed by Samtools) | |-w|The width of the sliding window | |-m|The configuration file of the HMM | |-t|The filtration threshold of the quality score. It discards a site if P(Ref|Data)>THRESH. | |-d|The switch toggle to open the snp calling for the diploid organism |

Report Format

|Column 1|The genomic coordiante started from '0'| |:-------|:--------------------------------------| |Column 2|The reference allele | |Column 3|The called allele | |Column 4|The quality score | |Column 5|The log posterior probability of the called genotype|

PyroHMM

How-To

To train HMM

PyroHMM accepts a file of the following format as the training data, ```

Read_id start=1 length=400 strand=+ Q= AAACCAGT Reference_id start=900 length=400 AACCAGT ...... and an index file which has two columns, (1) read index, (2) reference index, like, 1 2 3 4 ...... ``` The training file which has the above format can be extracted from the mapping result generated by MegaBlast by using the script blast2seqpair.pl in the download package, or the mapping result generated by other mapping tools, i.e. SSAHA2,BWA-SW,Mosaik, by using the script sam2fasta.py.

The index file can be generated by the script generate_data_index.py in the download package.

To train the hidden Markov model, enter the following command in the command line, ./pyrohmm -d <FASTA_FILE> -i <INDEX_FILE> -o <OUTPUT_FILE> -t -b <BAND_WIDTH> -I <MAX_ITERATION> -T <CONVERGENCE_THRESHOLD> -v -l <LOG_FILE>

To realign

PyroHMM provides the function to re-align the original mapping result generated by a mapping tool. PyroHMM performs the realignment by taking the template sequence in units of homopolymers. PyroHMM can also do the realignment by taking the template sequence in units of nucleotides, by toggling the option -n.

PyroHMM accepts the file of the above described format as the input.

To do realignment, enter the following command in the command line, ./pyrohmm -d <FASTA_FILE> -i <INDEX_FILE> -m <MODEL_FILE> -o <OUTPUT_FILE>

News

  • 2012-9-27 PyroHMMsnp v1.0 and PyroHMM v1.0 first released.
  • 2012-10-3 (1) Update the Help message. (2) Add option '-n' to PyroHMM

Citation

Zeng F, Jiang R, Chen T. PyroHMMsnp: a SNP caller for Ion Torrent and 454 sequencing data. Submitted. 2012

Links

  1. Professor Ting Chen's group
  2. Bioinformatics Division of Tsinghua University

Project Information

The project was created on May 15, 2012.

Labels:
Academic Bioinformatics Algorithm NextGenerationSequencing SNPcalling Realignment HiddenMarkovModel Homopolymer ErrorModel