My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
FastqMultx  
One-sentence summary of this page.
Updated May 17, 2012 by earone...@gmail.com

Introduction

The idea behind this is to reduce the amount of "piping" going on in a pipeline. A lot of time, disk space and nail-chewing is spent keeping files in sync, figuring out what barcodes are on what samples, etc. The goal of this program is to make it easier to demultiplex possibly paired-end sequences, and also to allow the "guessing" of barcode sets based on master lists of barcoding protocols (fluidigm, truseq, etc.)

Usage

Usage: fastq-multx [-g|-l] <barcodes.fil> <read1.fq> -o r1.%.fq [mate.fq -o r2.%.fq] ...

Output files must contain a '%' sign which is replaced with the barcode id in the barcodes file.

Barcodes file looks like this:

<id1> <sequence1>
<id2> <sequence2> ...

Default is to guess the -bol or -eol based on clear stats.

If -g is used, then it's parameter is an index lane, and frequently occuring sequences are used.

If -l is used then all barcodes in the file are tried, and the *group* with the *most* matches is chosen.

Grouped barcodes file looks like this:

<id1> <sequence1> <group1>
<id1> <sequence1> <group1>
<id2> <sequence2> <group2>...

Mated reads, if supplied, are kept in-sync

Options:

-o FIL1 [FIL2]  Output files (one per input, required)
-g FIL          Determine barcodes from indexed read FIL
-l FIL          Determine barcodes from any read, using FIL as a master list
-b              Force beginning of line
-e              Force end of line
-x              Don't trim barcodes before writing
-n              Don't execute, just print likely barcode list
-v C            Verify that mated id's match up to character C ('/' for illumina)
-m N            Allow up to N mismatches, as long as they are unique

Files named ".gz" are assumed to be compressed, and can be read/written as long as "gzip" is in the path.

Example 1

# this example will read/output files that are gzipped, since -B is used and only 1 sequence files is present, it will look for barcodes on the "ends" of the sequence and will tell you which end it found them on

fastq-multx -B barcodes.fil seq.fastq.gz -o %.fq.gz

Contents of barcodes.fil:

mock_a ACCC
salt_a CGTA
mock_b  GAGT
salty_b  CGGT

Example 2

# this example will first determine which "barcode group" to use, will select the most likely set of barcodes from that file, and will then proceed as if only that set was specified. this allows for a single pipeline that works with multiple technologies

fastq-multx -l barcodes.grp seq2.fastq.gz seq1.fastq.gz -o n/a -o out%.fq

Contents of barcodes.grp:

id      seq     style
LB1     ATCACG  TruSeq
LB2     CGATGT  TruSeq
LB3     TTAGGC  TruSeq
LB4     TGACCA  TruSeq
LB5     ACAGTG  TruSeq
A01_01  TAGCTTGT        Fluidigm
B01_02  CGATGTTT        Fluidigm
C01_03  GCCAATGT        Fluidigm
D01_04  ACAGTGGT        Fluidigm
E01_05  ATCACGTT        Fluidigm

Standard error will output:

Using Barcode Group: TruSeq on File: seq2.fastq.gz (start), Threshold 0.59%

This indicated that The LB1-LB5 barcodes will be used, and that the filess will be named LB1-LB5, and that the barcode was at the "start" of the reads in the seq2 file.

Comment by alexgraehl@gmail.com, Jan 11, 2012

Hi! Would it be possible to get an example of a real <barcode.fil> example file, and a successful command? I am having trouble getting the format right, and it seems like the lines in the barcode file need to (for some reason) start with a '@'.

Comment by janu.fl...@gmail.com, Jan 13, 2012

Hello, I used the following command to demultiplex based on barcode.

fastq-multx -b barcode-list seq.fastq.gz -o %.fq

I have the barcodes occuring at the begining of the sequence in fastq file. My sample barcode list goes like this. BC1 CTCC BC4 CTATTA

I looked at the results for BC1.fq and the sequence length is reduced. My original sequence length is 100, after executing the above command my sequence length is reduced to 97. Though the barcode length is 4.

I am quite confused here. Does the command separates sequences based on barcode or does it strip the barcodes from the sequence and write the rest of the sequences to a file?

Thank you

Comment by project member earone...@gmail.com, Jan 13, 2012

It separates sequences based on the barcode, and it also strips the barcodes from the sequence and writes the remainder. You can turn off "trimming" with the option -x, in which case the barcode will be "left on" the sequence.

It is supposed to trim off all 4 bases.... not 3.

Comment by project member earone...@gmail.com, Jan 13, 2012

If you post a sample file to the "Issues" tab above, I can see why it might only be trimming 3.

Comment by msam...@gmail.com, Jan 26, 2012

Hi, I am using the following command fastq-multx -g barcodes.txt -b R1.fq R2.fq -o n/a -o r2%.fq

Can you tell me what i am doing wrong.

Comment by mooncost...@gmail.com, Jan 26, 2012

Get rid of the -g

Comment by msam...@gmail.com, Jan 27, 2012

Thanks for the answer, the following sytax worked:

fastq-multx -b barcodes.txt R1.fq R2.fq -o r1%.fq -o r2%.fq

Thank you!

Comment by kapoor.m...@gmail.com, Feb 3, 2012

Hi, I got these files from sequencing core. I don't know what is this s_G1_L001_I1?_001.fastq file for. Do I need to specify this file for de-multiplexing. These files are output from miseq. s_G1_L001_I1?_001.fastq s_G1_L001_R1?_002.fastq s_G1_L001_R1?_001.fastq

Thanks,

Manav

Comment by bradley....@gmail.com, Apr 18, 2012

Hi, I'm having issues running:

fastq-multx -g 1_2.fastq 1_1.fastq 1_3.fastq -o t1.%.fastq -o t2.%.fastq

1_2.fastq contains index reads, 1_1.fastq and 1_3.fastq are reads 1 and 2. Getting a segmentation fault. gdb gives:

Program received signal SIGSEGV, Segmentation fault. IO_getdelim (lineptr=0x7fffffffe320, n=0x7fffffffe338, delimiter=10, fp=0x0) at iogetdelim.c:58 58 iogetdelim.c: No such file or directory.

in iogetdelim.c

Running Ubuntu 11.10. Any thoughts?

Thanks, Brad

Comment by project member earone...@gmail.com, May 17, 2012

Sorry, The -g (self-guiding) mode has been fixed in release >= 370, and tests have been added to ensure it doesn't break again


Sign in to add a comment
Powered by Google Project Hosting