My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
FAQ  
Frequently Asked Questions
Featured
Updated Dec 2, 2010 by matthew....@gmail.com

Here are the answers to some technical questions that I have been asked about how to use PerM. Excuse me if I haven't been able to arrange all of them systematically.

For questions about PerM's algorithm, you may also want to check the Q & A questions in the 2008 UCS MCB retreat. You are very welcome to email me directly.

Q1: How can I map reads to multiple references?

PerM can accept multiple FASTA files for the reference and each FASTA file can contain multiple sequences. For a collection of short reference sequences such as genes it is probably simplest to concatenate all the sequences into a single FASTA file. But for large references files, for example full chromosomes, it is often more convenient to provide PerM with a single file, e.g. ref.txt which contains a list of the reference file names (one per line). When the reference file is a list of FASTA files it must have the .txt extension because PerM uses the file extension to distinguish this case from an ordinary FASTA reference file.

Q2: How many mismatches should I allow; is it correlated with the --seed option?

It depends on your read length, the sequencing platform error rate, expected biological variation, and your analysis needs. The --seed option determines how many mismatches can be found at full sensitivity though a higher number of mismatches can be found in many cases. The -v option sets the mismatch reporting threshold. For example, --seed F3 -v 5 will report every alignment with up to 3 mismatches and also theoretically report ~80% and ~60% of the alignments with 4 and 5 mismatches respectively. Seeds fully sensitive to larger numbers of mismatch require more time to run.

It is permissible to set the reporting limit to less than the seed full sensitivity limit, e.g. --seed F3 -v 2. This may be convenient if an F3 index has already been built for the reference. However when a large number of reads must be mapped, it would be more efficient to build and use an F2 index, e.g. --seed F2 -v 2, instead.

PerM, like most short read alignment programs, is primarily concerned with the case of only a few mismatches caused by the limitations of the sequencing technology. However, users may also be concerned with biological variation, e.g. due to SNPs or other evolutionary changes. In some cases the reference sequence(s) may only be an approximation to what may appear in the sequence sample, for example be a known strain. In metagenomics even the degree of expected variation may be quite uncertain. For large biological variation, a program like blastx which is geared toward biological similarity may be more appropriate. However the most efficient approach may still be a quick mapping by PerM of what is mappable with limited variation, followed by further attempts to map the remaining reads with other tools.

Q3: Will using the F3 or F4 seed increase running time by a lot?

For transcriptome or small genome with long reads (50+ bp) it is unnecessary to worry a lot about the performance. However, speed is a concern when the reference has a lot of repeat regions and the reads are short such that they randomly map to a lot of places. PerM actually reaches 40M 50 bp reads per hour even to hg18, with the F3 seed. For the F4 seed, it reaches about 2M reads per hours for entire genome, depending on whether the reads are from more repetitive regions. Let me know if the speed is lower than that. The speed of your disk read/write could be the bottleneck.

Q4: Do you have a script to summarize how many reads are mapped to a reference?

You can use SAMTOOLS or another visualization program to see the coverage. To summarize the reads per gene, for example:

gene1 100
gene2 50
gene3 10

try a simple script that I just wrote in 3 minutes.

#!/bin/bash
function count {
	iFileN=$1
	geneN=''
	let mapCount=0
	exec < $iFileN
	while read line
	do
	        if [[ $geneN == $line ]]; then
	                let mapCount=mapCount+1
	        else
	                echo $geneN $mapCount
	                let mapCount=1
	                geneN=$line
	        fi
	done 

}
RNAME=3
cat result.sam | cut -f $RNAME -d ' ' | sort | ./count > result.txt

Note this shell script is far from perfect. You can easily improve it with a Perl or C++ implementation.

Q5: The order of reads in the csfasta file and qual file are not the same. Does the order have to be the same?

Reads and qual must be in the same order but you can sort the reads and quality file according the read name, otherwise, PerM barks. If your read file is in a standard format, you can sort them by

cat reads.csfasta | tr '\n' '\t' | tr '>' '\n' | sort -k 1,1 | tr '\n' '>' | tr '\t' '\n'
cat reads.qual | tr '\n' '\t' | tr '>' '\n' | sort -k 1,1 | tr '\n' '>' | tr '\t' '\n'

Remember to backup your read files before any experiment. Unfortunately, I have seen many quality files that are embedded extra info in the read name.. ex: >readName,blah,blah.. so you need to use sed command to remove the blah blah ... first.

Q6: What if I have reference sequence(s) shorter than the read length?

That is fine. Internally, PerM uses 'N' to separate each reference. No reads will be mapped across an 'N'.

Q7: What happens if I have multiple reference sequences with the same reference name?

PerM barks by listing the error messages to warn you. Internally, PerM still maps the reads to each reference sequence separately. However, the output will print the same reference name, and the index of the reference will be messed up.

Q8: Should I save the index first and then do the mapping?

It's up to you. By default, PerM won't save the index. I can build hg18 index with 16 CPUs for PerM in half hour. For some users rebuilding the an index may be a small trade off compared to storing a 14 GB index file.

Also, since a PerM index is always at least 256 MB, you may not wish keep the index if the reference sequence is quite small because PerM can rebuild it very quickly.

If you do keep the index, be careful if you change the reference. Delete or rebuild the index to avoid confusion.

Q9: If my input reads have different read lengths, can they be mapped it together?

Unfortunately, a PerM index is for a fixed read length. If a pre-built index is not being used and neither the -T or -t options has been specified, PerM builds an index for the length of the first read. PerM will print warning message when it encounters a read with unexpected length and keep mapping the rest of reads. To check how many reads are thrown away, use wc -l command and check the different read length in the final report.

To handle reads with different length, split the reads according to their lengths into different files and map each of them independently. This is fairly easy to do using Perl.

Q10: Why there are more reads mapped to multiple places than the multi mapped stats when -A is specified?

In PerM Multi-mapped reads are defined as reads mapped to multiple places with same and minimum number of mismatches. For example, if a read is mapped to one place as an exact match and two other places with one mismatch, the read is not counted as Multi-Mapped read in the final statistics. However, all the mapped regions within the threshold will still be output if -A is specified.

Q11: Which value will PerM use if I repeat an option on the command line?

It will use the first value. For example, -v 5 -v 3 will limit the maximum number of mismatches reported to 5. But this behavior may change in the future. As with any program, options should not be duplicated unless the program specifically assigns some meaning to the duplication. Why confuse the next person who has to modify your code?

Q12: Why does the SAM output format of PerM has fewer lines than BFAST?

BFAST output includes reads that are NOT MAPPED. PerM can optionally output the unmapped reads to FASTA, CSFASTA or FASTQ format with the -u option so you can easily map the unmapped reads to other references.

Q13: Which seed should I use?

The seed is the pattern that PerM uses to do the indexing. The choice of a seed balances sensitivity and speed. For SOLiD reads, you may be interested in special full sensitivity, ex: one SNP one error (so use S11 seed) or one SNP two errors (so use S12). For read length 50, S12 is faster than F4 by slightly losing some sensitivity, because its weight is 15 or more, not just 13. For reads shorter than 25 bp and very small reference genomes, use the F1 or F0 seed.

Q14: What quality score format does PerM take?

There are two type of FASTQ scores (Phred Fastq and Illumina Fastq). For SOLiD reads, PerM assumes the the format with the highest score represented in ';' For Illumina reads, PerM assumes the quality score with the highest score represented in 'I'. PerM currently unable to recognize the score if you use the Illumina format for SOLiD or vice-versa.

Q15: Do other mapping programs find more exact matches than PerM?

No. PerM is fully sensitive to the threshold selected. One user reported Bowtie found more "exact matches" for SOLiD reads, because Bowtie "corrects" the single mismatches as errors and reports only the "nucleotide" difference. Therefore, the result caused some confusion. Other programs only match the first N nucleotide of read prefix, say first 28 or 32 bp, by default, i.e. mismatched at the tail are ignored. If you choose program options in those programs to match PerM or conversely set the -T PerM option appropriately, the mapping should be the same, provided the the other mapper is also fully sensitive.

Finally, by default, PerM eliminate reads with '.' and 'N'. In PerM, one may encode '.' as '3' and 'N' as 'A' using the --includeReadsWN switch; however, it is misleading and not recommended. If other mappers treat reads with '.' and 'N' differently, the results may be different.

Back to manual

Comment by farhad.h...@gmail.com, Apr 1, 2011

Q16: How I can run the PerM to use only one CPU to the read mapping?


Sign in to add a comment
Powered by Google Project Hosting