My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
FastqMcf  
fastq-mcf sequence quality filter, clipping and processor.
Updated Jan 17, 2012 by earone...@gmail.com

Introduction

fastq-mcf attempts to:

  • Detect & remove sequencing adapters and primers
  • Detect limited skewing at the ends of reads and clip
  • Detect poor quality at the ends of reads and clip
  • Detect N's, and remove from ends
  • Remove reads with CASAVA 'Y' flag (purity filtering)
  • Discard sequences that are too short after all of the above
  • Keep multiple mate-reads in sync while doing all of the above

Usage

usage: fastq-mcf [options] <adapters.fa> <reads.fq> [mates1.fq ...]

Detects levels of adapter presence, computes likelihoods and
locations (start, end) of the adapters.   Removes the adapter
sequences from the fastq file, and sends it to stdout.

Stats go to stderr, unless -o is specified.

If you specify multiple 'paired-end' inputs, then a -o option is
required for each.  IE: -o read1.clip.q -o read2.clip.fq

Options:
 -h This help
 -o FIL Output file (stats to stdout)
 -s N.N Log scale for clip pct to threshold (2.5)
 -t N % occurance threshold before clipping (0.25)
 -m N Minimum clip length, overrides scaled auto (1)
 -p N Maximum adapter difference percentage (20)
 -l N Minimum remaining sequence length (15)
 -L N   Maximum sequence length (none)
 -k N sKew percentage causing trimming (2)
 -q N quality threshold causing trimming (10)
 -f force output, even if not much will be done
 -0     Set all trimming parameters to zero
 -U|u   Force disable/enable illumina PF filtering
 -P N phred-scale (64)
 -x N 'N' (Bad read) percentage causing trimming (10)
 -R      Don't remove N's from the fronts/ends of reads
 -n Don't clip, just output what would be done
 -C N   Number of reads to use for subsampling (200000)
 -d     Output lots of random debugging stuff

Increasing the scale makes recognition-lengths longer, a scale
of 100 will force full-length recognition.

Set the skew (-k) or N-pct (-x) to 100 or 0 to turn it off

Files named ".gz" are assumed to be compressed, and can be 
read/written as long as "gzip" is in the path.

Notes

Adapter file format is fasta. You can set it to /dev/null, and pass "-f" to do skew detection only.

Todo

  • When discarding one read for being "too short", it has to discard both pairs. For a sequencing run of normal quality this is not an issue. It should, though, write "un-mated" reads (whose mate was skipped) to a separate file. Typically, since these read mates were poor quality, it's not really useful... but it can be for diagnostics. I've seen runs where these provide valuable data.
  • Like any tool that does many things, fastq-mcf can be limited in it's ability to be flexible. The biggest missing feature is for it to be able to read files that are formatted like it's stderr output, and use them to guide the process. Given that feature, fastq-mcf would be complete.

Notes

  • Default settings are probably too conservative when it comes to skewing and trimming.
  • By default, it won't trim the "insides" of a paired-end read. It also will no longer attempt to quality filter a barcode read. No override for these, but I can't think of a reason to.
Comment by spoll...@missouri.edu, Sep 3, 2011

Please define skewing as used in fastq-mcf. Thanks

Comment by Nitzan...@gmail.com, Nov 23, 2011

Hi, the U|u flag doesn't seem to work even in the lastest version 1-237. I may be wrong but it seems you step on it on line 325 in the .c file.

Comment by daweonline, Dec 16, 2011

what are exactly options

-s N.N Log scale for clip pct to threshold (2.5)

-t N % occurance threshold before clipping (0.25)

? I don't really understand that :-(

Comment by sir.sven...@gmail.com, Jan 16, 2012

Is this the preferred way to ask for help??? Or is there some contact (which I cannot find) or forum?

Comment by project member earone...@gmail.com, Jan 17, 2012

I created a forum... you can use that, post files to it, etc. Also there is an "issues" list, above.

Comment by sir.sven...@gmail.com, Jan 23, 2012

Hi,

is there a more detailed description of parameters, results and function principle? The software seems to do a quite good job, but it would be helpful to have some more background info to actually understand what's going on.

Can you provide such info?

Comment by sir.sven...@gmail.com, Jan 25, 2012

To be honest, I thought the forum is there to not only post questions but also to get answers ;-) No offense, just a comment.

Comment by Aubombar...@gmail.com, May 21 (4 days ago)

Hi,

I have used Ea-utils to clean some sequences and finally we are going to publish the results. How can I cite Ea-utils, or fastq-mcf ?

Thanks

Comment by project member earone...@gmail.com, May 22 (2 days ago)

Please cite:

Erik Aronesty (2011). ea-utils : "Command-line tools for processing biological sequencing data"; Expression Analysis, Durham, NC http://code.google.com/p/ea-utils


Sign in to add a comment
Powered by Google Project Hosting