Introduction
The idea behind this is to reduce the amount of "piping" going on in a pipeline. A lot of time, disk space and nail-chewing is spent keeping files in sync, figuring out what barcodes are on what samples, etc. The goal of this program is to make it easier to demultiplex possibly paired-end sequences, and also to allow the "guessing" of barcode sets based on master lists of barcoding protocols (fluidigm, truseq, etc.)
Usage
Usage: fastq-multx [-g|-l] <barcodes.fil> <read1.fq> -o r1.%.fq [mate.fq -o r2.%.fq] ...
Output files must contain a '%' sign which is replaced with the barcode id in the barcodes file.
Barcodes file looks like this:
<id1> <sequence1>
<id2> <sequence2> ...
Default is to guess the -bol or -eol based on clear stats.
If -g is used, then it's parameter is an index lane, and frequently occuring sequences are used.
If -l is used then all barcodes in the file are tried, and the *group* with the *most* matches is chosen.
Grouped barcodes file looks like this:
<id1> <sequence1> <group1>
<id1> <sequence1> <group1>
<id2> <sequence2> <group2>...
Mated reads, if supplied, are kept in-sync
Options:
-o FIL1 [FIL2] Output files (one per input, required)
-g FIL Determine barcodes from indexed read FIL
-l FIL Determine barcodes from any read, using FIL as a master list
-b Force beginning of line
-e Force end of line
-x Don't trim barcodes before writing
-n Don't execute, just print likely barcode list
-v C Verify that mated id's match up to character C ('/' for illumina)
-m N Allow up to N mismatches, as long as they are uniqueFiles named ".gz" are assumed to be compressed, and can be read/written as long as "gzip" is in the path.
Example 1
# this example will read/output files that are gzipped, since -B is used and only 1 sequence files is present, it will look for barcodes on the "ends" of the sequence and will tell you which end it found them on
fastq-multx -B barcodes.fil seq.fastq.gz -o %.fq.gz
Contents of barcodes.fil:
mock_a ACCC
salt_a CGTA
mock_b GAGT
salty_b CGGT
Example 2
# this example will first determine which "barcode group" to use, will select the most likely set of barcodes from that file, and will then proceed as if only that set was specified. this allows for a single pipeline that works with multiple technologies
fastq-multx -l barcodes.grp seq2.fastq.gz seq1.fastq.gz -o n/a -o out%.fq
Contents of barcodes.grp:
id seq style
LB1 ATCACG TruSeq
LB2 CGATGT TruSeq
LB3 TTAGGC TruSeq
LB4 TGACCA TruSeq
LB5 ACAGTG TruSeq
A01_01 TAGCTTGT Fluidigm
B01_02 CGATGTTT Fluidigm
C01_03 GCCAATGT Fluidigm
D01_04 ACAGTGGT Fluidigm
E01_05 ATCACGTT Fluidigm
Standard error will output:
Using Barcode Group: TruSeq on File: seq2.fastq.gz (start), Threshold 0.59%
This indicated that The LB1-LB5 barcodes will be used, and that the filess will be named LB1-LB5, and that the barcode was at the "start" of the reads in the seq2 file.
Hi! Would it be possible to get an example of a real <barcode.fil> example file, and a successful command? I am having trouble getting the format right, and it seems like the lines in the barcode file need to (for some reason) start with a '@'.
Hello, I used the following command to demultiplex based on barcode.
fastq-multx -b barcode-list seq.fastq.gz -o %.fq
I have the barcodes occuring at the begining of the sequence in fastq file. My sample barcode list goes like this. BC1 CTCC BC4 CTATTA
I looked at the results for BC1.fq and the sequence length is reduced. My original sequence length is 100, after executing the above command my sequence length is reduced to 97. Though the barcode length is 4.
I am quite confused here. Does the command separates sequences based on barcode or does it strip the barcodes from the sequence and write the rest of the sequences to a file?
Thank you
It separates sequences based on the barcode, and it also strips the barcodes from the sequence and writes the remainder. You can turn off "trimming" with the option -x, in which case the barcode will be "left on" the sequence.
It is supposed to trim off all 4 bases.... not 3.
If you post a sample file to the "Issues" tab above, I can see why it might only be trimming 3.
Hi, I am using the following command fastq-multx -g barcodes.txt -b R1.fq R2.fq -o n/a -o r2%.fq
Can you tell me what i am doing wrong.
Get rid of the -g
Thanks for the answer, the following sytax worked:
fastq-multx -b barcodes.txt R1.fq R2.fq -o r1%.fq -o r2%.fq
Thank you!
Hi, I got these files from sequencing core. I don't know what is this s_G1_L001_I1?_001.fastq file for. Do I need to specify this file for de-multiplexing. These files are output from miseq. s_G1_L001_I1?_001.fastq s_G1_L001_R1?_002.fastq s_G1_L001_R1?_001.fastq
Thanks,
Manav
Hi, I'm having issues running:
fastq-multx -g 1_2.fastq 1_1.fastq 1_3.fastq -o t1.%.fastq -o t2.%.fastq
1_2.fastq contains index reads, 1_1.fastq and 1_3.fastq are reads 1 and 2. Getting a segmentation fault. gdb gives:
Program received signal SIGSEGV, Segmentation fault. IO_getdelim (lineptr=0x7fffffffe320, n=0x7fffffffe338, delimiter=10, fp=0x0) at iogetdelim.c:58 58 iogetdelim.c: No such file or directory.
Running Ubuntu 11.10. Any thoughts?
Thanks, Brad
Sorry, The -g (self-guiding) mode has been fixed in release >= 370, and tests have been added to ensure it doesn't break again