|
Introduction
Introduction to Biopieces
Featured Biopieces IntroductionIntroductionIn the most simple form Biopieces can be pieced together on the command line like this (using the pipe character '|'): read_data | calculate_something | write_result However, a more comprehensive analysis could be composed: read_data | select_entries | convert_entries | search_database | evaluate_results | plot_diagram | plot_another_diagram | load_to_database In the documentation a long pipe like the above will typically be written on multiple lines to give a better over view like this: read_data | select_entries | convert_entries | search_database | evaluate_results | plot_diagram | plot_another_diagram | load_to_database You can either type these on the command line in a single line, or copy/paste multiple lines. Each Biopiece perform only a single task and may react only to data records (or Biopiece records) containing certain information. E.g. Biopieces that analyze sequence data only react to Biopiece records containing sequences, while all other records simply are passed through, so that they may be analyzed further downstream. Sometimes Biopiece records are not passed through, e.g. if sequences are assembled into longer sequences - then the input sequences are replaced by the output sequences. Biopieces do this by following the principle of least surprise. The data stream is terminated at the end of the last analysis. The data stream that is piped through the Biopieces consists of records of key/value pairs in the same way a hash does in order to keep as simple a structure as possible. An example record can be seen below: REC_TYPE: PATSCAN MATCH: AGATCAAGTG S_BEG: 7 S_END: 16 ALIGN_LEN: 9 S_ID: piR-t6 STRAND: + PATTERN: AGATCAAGTG --- The --- denotes the delimiter of the records, and each key is a word followed by a ':' and a white-space and then the value. By convention the Biopieces only uses upper case keys. Since the records basically are hash structures this mean that the order of the keys in the stream is unordered, and in the above example it is pure coincidence that HIT_BEG is displayed before HIT_END. All of the Biopieces are able to read and write a data stream to and from file as long as the records are in the Biopieces format. This means that if you are undertaking a lengthy analysis where one of the steps is time consuming, you may save the stream after this step, and subsequently start one or more analysis from that last step (see How to write the data stream to file). If you are running a lengthy analysis, it is highly recommended that you create a small test sample of the data and run that through the pipeline --- and once you are satisfied with the result proceed with the full data set. All of the Biopieces can be supplied with long arguments prefixed with switches or single character arguments prefixed with - switches that can be grouped together (e.g. -xok). In this introduction only the long switches are used to emphasize what these switches do. Getting startedFirst you need to be acquainted with some basic Unix commands, here is a tutorial (Tutorial 1-3 should do): http://www.ee.surrey.ac.uk/Teaching/Unix/ And when you are ready to move on - move on: The Biopiece list_biopieces lists all the Biopieces along with a summary: list_biopieces align_seq Align sequences in stream using Muscle. analyze_seq Analysis the residue composition of each sequence in stream. analyze_vals Determine type, count, min, max, sum and mean for values in stream. blast_seq BLAST sequences in stream against a specified database. blat_seq BLAT sequences in stream against a specified genome. complement_seq Complement sequences in stream. count_records Count the number of records in stream. count_seq Count sequences in stream. count_vals Count the number of times values of given keys exists in stream. create_blast_index Create a BLAST database from sequences in stream for use with BLAST. ... To list the Biopieces for writing different formats, you can use UNIX's grep like this: $ list_biopieces | grep write write_align Write aligned sequences in pretty alignment format. write_bed Write records from stream as BED lines. write_blast Write BLAST records from stream in BLAST tabular format (-m8 and 9). write_fasta Write sequences in FASTA format. write_psl Write records from stream in PSL format. write_tab Write records from stream as tab separated table. ... In order to find out how a specific Biopiece works, you just type the program name without any arguments and press return and the usage of the Biopiece will be displayed. E.g. read_fasta <return>: Biopiece: read_fasta Summary Read FASTA entries from one or more files. Usage read_fasta [options] -i <FASTA file(s)> Options [-? | --help] # Print full usage description. [-i <file(s)> | --data_in=<file(s)>] # Comma separated list of files or glob expression to read. [-n <int> | --num=<int>] # Limit number of records to read. [-I <file> | --stream_in=<file>] # Read input stream from file - Default=STDIN [-O <file> | --stream_out=<file>] # Write output stream to file - Default=STDOUT Help read_fasta is part of the Biopieces framework. http://www.biopieces.org Now, this is only a short summary of the usage, to print the full usage use the --help switch to get this: Biopiece: read_fasta Summary Read FASTA entries from one or more files. Description read_fasta read in sequence entries from FASTA files. Each sequence entry consists of a sequence name prefixed by a '>' followed by the sequence name on a line of its own, followed by one or my lines of sequence until the next entry or the end of the file. The resulting biopiece record consists of the following record type: SEQ_NAME: test SEQ_LEN: 10 SEQ: ATCGATCGAC --- For more about the FASTA format: http://en.wikipedia.org/wiki/Fasta_format Usage read_fasta [options] -i <FASTA file(s)> Options [-? | --help] # Print full usage description. [-i <file(s)> | --data_in=<file(s)>] # Comma separated list of files or glob expression to read. [-n <int> | --num=<int>] # Limit number of records to read. [-I <file> | --stream_in=<file>] # Read input stream from file - Default=STDIN [-O <file> | --stream_out=<file>] # Write output stream to file - Default=STDOUT Examples To read all FASTA entries from a file: read_fasta -i test.fna To read in only 10 records from a FASTA file: read_fasta -n 10 -i test.fna To read all FASTA entries from multiple files: read_fasta -i test1.fna,test2.fna To read FASTA entries from multiple files using a glob expression: read_fasta -i '*.fna' See also read_align write_fasta Author Martin Asser Hansen - Copyright (C) - All rights reserved. mail@maasha.dk August 2007 License GNU General Public License version 2 http://www.gnu.org/copyleft/gpl.html Help read_fasta is part of the Biopieces framework. http://www.biopieces.org Or you can go the the appropriate wiki page read_fasta at Which currently points to Google Code: http://code.google.com/p/biopieces/wiki/read_fasta The Data StreamSince Biopieces work by passing data records from Biopiece to Biopiece you always need more than one Biopiece (execpt special cases). Typically, one Biopiece to parse data from some format, and one Biopiece to perform some analysis like counting or plotting - or writing the data in a new format. Data in different formats can be read with the appropriate Biopiece for that format. The Biopieces are typically named read_<data type> such as read_fasta, read_bed, read_tab, etc., and all behave in a similar manner. Data can be read by supplying the --data_in switch and a file name to the file containing the data: <biopiece> --data_in=<file> Consider the following three FASTA entries in the file test.fna: >test1 AGATTGAGATTATACTTGCAGCACCCGTCC >test2 ACGTATTCGCGGGACAACACTTTCTGAACT >test3 ATTGTAACGAACACCAGATCCCTGCTTGTG In order to read these we use read_fasta like this: read_fasta --data_in=test.fna SEQ_NAME: test1 SEQ: AGATTGAGATTATACTTGCAGCACCCGTCC SEQ_LEN: 30 --- SEQ_NAME: test2 SEQ: ACGTATTCGCGGGACAACACTTTCTGAACT SEQ_LEN: 30 --- SEQ_NAME: test3 SEQ: ATTGTAACGAACACCAGATCCCTGCTTGTG SEQ_LEN: 30 --- and et viola, the sequences are emitted in the data stream as Biopiece records - and displayed in the terminal window. Now we perform the most basic analysis - we count the records using count_records: read_fasta --data_in=test.fna | count_records SEQ: AGATTGAGATTATACTTGCAGCACCCGTCC SEQ_LEN: 30 SEQ_NAME: test1 --- SEQ: ACGTATTCGCGGGACAACACTTTCTGAACT SEQ_LEN: 30 SEQ_NAME: test2 --- SEQ: ATTGTAACGAACACCAGATCCCTGCTTGTG SEQ_LEN: 30 SEQ_NAME: test3 --- RECORDS_COUNT: 3 --- Now the sequence records are supplemented with an extra record holding the RECORD_COUNT. But that is not very handy, if you have millions of sequences, that these are all displayed. This is because the data stream never stops unless the user want to save the stream or by applying the --no_stream switch that will terminate the stream. The --no_stream switch only works with those Biopieces where it makes sense that the user might want to terminate the data stream, i.e. after an analysis step where the user wants to output the result, but not the data stream. Thus if we use the --no_stream with count_records we get: read_fasta --data_in=test.fna | count_records --no_stream RECORDS_COUNT: 3 --- Which has the desired effect. Notice that the output from count_records is a Biopiece record, but it was not part of the data stream. If we want to save the sequence data in a new FASTA file we have to use write_fasta: read_fasta --data_in=test.fna | count_records --no_stream | write_fasta RECORDS_COUNT: 3 --- But this is wrong? - there are no FASTA entries. The output from count_records is a Biopiece record with the RECORD_COUNT and that record is passed to write_fasta, and since write_fasta don't detect any sequence type records (records with both SEQ_NAME and SEQ keys) the record is simple passed through write_fasta. Of cause we need to remove the --no_stream switch to count_records: read_fasta --data_in=test.fna | count_records | write_fasta RECORDS_COUNT: 3 --- >test1 AGATTGAGATTATACTTGCAGCACCCGTCC SEQ: AGATTGAGATTATACTTGCAGCACCCGTCC SEQ_NAME: test1 SEQ_LEN: 30 --- >test2 ACGTATTCGCGGGACAACACTTTCTGAACT SEQ: ACGTATTCGCGGGACAACACTTTCTGAACT SEQ_NAME: test2 SEQ_LEN: 30 --- >test3 ATTGTAACGAACACCAGATCCCTGCTTGTG SEQ: ATTGTAACGAACACCAGATCCCTGCTTGTG SEQ_NAME: test3 SEQ_LEN: 30 --- But this is also wrong? We now have a mixture of FASTA type data and Biopiece records including a RECORDS_COUNT record. This is of cause because we forgot to terminate the stream with --no_stream: read_fasta --data_in=test.fna | count_records | write_fasta --no_stream >test1 AGATTGAGATTATACTTGCAGCACCCGTCC >test2 ACGTATTCGCGGGACAACACTTTCTGAACT >test3 ATTGTAACGAACACCAGATCCCTGCTTGTG Now the sequences are output nicely in FASTA format, but where did the RECORD_COUNT go? That was passed to write_fasta, which still does not react to records without SEQ_NAME and SEQ keys - thus this record was simply dropped. The trick is to save the output of count_records to a file: read_fasta --data_in=test.fna | count_records --data_out=count.txt SEQ: AGATTGAGATTATACTTGCAGCACCCGTCC SEQ_LEN: 30 SEQ_NAME: test1 --- SEQ: ACGTATTCGCGGGACAACACTTTCTGAACT SEQ_LEN: 30 SEQ_NAME: test2 --- SEQ: ATTGTAACGAACACCAGATCCCTGCTTGTG SEQ_LEN: 30 SEQ_NAME: test3 --- Now the COUNT_RECORD entry is saved to the file count.txt and the sequence type Biopiece records are passed through. We can verify the content of count.txt using the UNIX command cat: cat count.txt RECORDS_COUNT: 3 --- Of cause we could still use --no_stream with count_records while at the same time saving the output to file: read_fasta --data_in=test.fna | count_records --data_out=count.txt --no_stream Thus the RECORD_COUNT is written to the count.txt file and no Biopiece records are emitted. Now let us save the record count and the FASTA entries: read_fasta --data_in=test.fna | count_records --data_out=count.txt | write_fasta --data_out=test2.fna SEQ: AGATTGAGATTATACTTGCAGCACCCGTCC SEQ_NAME: test1 SEQ_LEN: 30 --- SEQ: ACGTATTCGCGGGACAACACTTTCTGAACT SEQ_NAME: test2 SEQ_LEN: 30 --- SEQ: ATTGTAACGAACACCAGATCCCTGCTTGTG SEQ_NAME: test3 SEQ_LEN: 30 --- Now the COUNT_RECORD is saved to count.txt like before, and the FASTA entries are saved to test2.fna which we can verify with cat: cat test2.fna >test1 AGATTGAGATTATACTTGCAGCACCCGTCC >test2 ACGTATTCGCGGGACAACACTTTCTGAACT >test3 ATTGTAACGAACACCAGATCCCTGCTTGTG The final touch is to use --no_stream with write_fasta to terminate the stream: read_fasta --data_in=test.fna | count_records --data_out=count.txt | write_fasta --data_out=test2.fna --no_stream Which is identical to the below short-hand command using the short switches: read_fasta -i test.fna | count_records -o count.txt | write_fasta -xo test2.fna This is the end of the introduction. Now it is a good time to head over to the HowTo for inspiration. |