|
remove_adaptor
Locates and removes a specified adaptor sequence from sequences in stream.
Biopiece: remove_adaptorDescriptionIf you want to remove the adaptor sequence from sequences in the stream you can use remove_adaptor which will locate and remove the adaptor allowing for a number of mismatches (but no indels). The remove modes available are:
For records with both sequence (SEQ) and a quality score string (SCORE) both will be trimmed in case of remove mode 'before' or 'after'. (make sure the SCORE string is ASCII encoded, not in semicolon seperated decimals). NB! Only the first occurrence of the adaptor in any one sequence is located. Usage... | remove_adaptor [options] Options[-? | --help] # Print full usage description. [-a <string> | --adaptor=<string>] # Adaptor sequence to locate and remove. [-m <uint> | --mismatches=<uint>] # Max number of mismatches - Default=0 [-o <uint> | --offset=<uint>] # Search sequence from offset (1-based) - Default=1 [-r <string> | --remove=<string>] # Remove mode: before|after|skip - Default=after [-I <file!> | --stream_in=<file!>] # Read input from stream file - Default=STDIN [-O <file> | --stream_out=<file>] # Write output to stream file - Default=STDOUT [-v | --verbose] # Verbose output. ExamplesConsider the following FASTA entries in test.fna: >CE5_ID00000000 GAGGAAGAAGGAATATTTATCGTATGCCGTCTT >CE5_ID00000001 GAGGAAGAAGGAATATTTTTCGTATGCCGTCTT >CE5_ID00000002 GAATGTAAGGAAGTGTGTGGATTCGTATgCCGT >CE5_ID00000003 GTTGTAAAGCTCTTTTGTCCtggaATCtTaTGc >CE5_ID00000004 GTAGGATGAGTGACTACTCAAaTCGTATGCCGT To locate the following standard Solexa 3' adaptor TCGTATGCCGTCTTCTGCTTG use remove_adaptor with the first part of the adaptor and allow for two mismatches with the -m switch: read_fasta -i test.fna| remove_adaptor -a TCGTATGCC -m 2 The resulting output will have the adaptor sequence removed if it was found. Also an ADAPTOR_POS keys is added to the records. An ADAPTOR_POS of -1 indicates that no adaptor sequence was found and can be used with grab. SEQ: GAGGAAGAAGGAATATTTA ADAPTOR_POS: 19 SEQ_NAME: CE5_ID00000000 SEQ_LEN: 19 --- SEQ: GAGGAAGAAGGAATATTTT ADAPTOR_POS: 19 SEQ_NAME: CE5_ID00000001 SEQ_LEN: 19 --- SEQ: GAATGTAAGGAAGTGTGTGGAT ADAPTOR_POS: 22 SEQ_NAME: CE5_ID00000002 SEQ_LEN: 22 --- SEQ: GTAGGATGAGTGACTACTCAAa ADAPTOR_POS: 22 SEQ_NAME: CE5_ID00000004 SEQ_LEN: 22 --- Use the -r before switch to locate 5' adaptors: read_fasta -i test.fna | remove_adaptor -a GAAGAAGG -r before SEQ: AATATTTATCGTATGCCGTCTT ADAPTOR_POS: 3 SEQ_NAME: CE5_ID00000000 SEQ_LEN: 22 --- SEQ: AATATTTTTCGTATGCCGTCTT ADAPTOR_POS: 3 SEQ_NAME: CE5_ID00000001 SEQ_LEN: 22 --- SEQ: GAATGTAAGGAAGTGTGTGGATTCGTATgCCGT ADAPTOR_POS: -1 SEQ_NAME: CE5_ID00000002 SEQ_LEN: 33 --- SEQ: GTTGTAAAGCTCTTTTGTCCtggaATCtTaTGc ADAPTOR_POS: -1 SEQ_NAME: CE5_ID00000003 SEQ_LEN: 33 --- SEQ: GTAGGATGAGTGACTACTCAAaTCGTATGCCGT ADAPTOR_POS: -1 SEQ_NAME: CE5_ID00000004 SEQ_LEN: 33 --- Using the -r skip switch will suppress the adaptor removal: read_fasta -i test.fna| remove_adaptor -a TCGTATGCC -m 2 -r skip SEQ: GAGGAAGAAGGAATATTTATCGTATGCCGTCTT ADAPTOR_POS: 19 SEQ_NAME: CE5_ID00000000 SEQ_LEN: 33 --- SEQ: GAGGAAGAAGGAATATTTTTCGTATGCCGTCTT ADAPTOR_POS: 19 SEQ_NAME: CE5_ID00000001 SEQ_LEN: 33 --- SEQ: GAATGTAAGGAAGTGTGTGGATTCGTATgCCGT ADAPTOR_POS: 22 SEQ_NAME: CE5_ID00000002 SEQ_LEN: 33 --- SEQ: GTAGGATGAGTGACTACTCAAaTCGTATGCCGT ADAPTOR_POS: 22 SEQ_NAME: CE5_ID00000004 SEQ_LEN: 33 --- See alsoAuthorMartin Asser Hansen & Selene Fernandez - Copyright (C) - All rights reserved. mail@maasha.dk August 2008 LicenseGNU General Public License version 2 http://www.gnu.org/copyleft/gpl.html Helpremove_adaptor is part of the Biopieces framework. |