My favorites | Sign in
Project Home Wiki Issues Source
READ-ONLY: This project has been archived. For more information see this post.
Search
for
ExportSaintOutputScript  
Updated Apr 12, 2011 by rr.weinb...@gmail.com

#Description of Export Saint script.

Export Saint Output command line Script

export_saint.pl is a script that can be used to export a set of experiments in the Prohits database formatting the output into a set of files that can be used to run Saint (http://www.nature.com/nmeth/journal/v8/n1/abs/nmeth.1541.html ), a probabilistic scoring method for mass spectrometry data.

Running the script

./export_saint.pl

Description

  • Export SAINT files from the Prohits database. Sequence lengths are resolved for UniProt proteins using the fasta file (if provided), otherwise using UniProt dump files. Display names for prey and bait proteins are resolved using the Prohits_proteins database, but requires pre-computation of table PROTEIN_ACCESSION for storing name mappings.
  • Regarding sequence length calculation for prey.dat files:
  • if the prey protein in Prohits is identified with:
    • a UniProt accession:
      • determine the length from the --fasta file, or
      • determine the length from UniProt dump file
      • (refer to /home/james/workspace/prohits/Prohits/script/resource/berkleydb/mk_uniprot_to_length_db.sh)
    • a GI:
      • determine UniProt accession from GI using UniProt dump file, then
      • (refer to /home/james/workspace/prohits/Prohits/script/resource/berkleydb/mk_gi_to_uniprot_db.sh)
      • determine the length from the --fasta file, or
      • determine the length from UniProt dump file
  • Regarding determination of PreyName and BaitName for inter.dat and bait.dat files:
  • if Prohits_proteins.PROTEIN_ACCESSION does not exist or --rebuild is on:
    • create table by querying Protein_Accession and Protein_Class from database Prohits_proteins

High-level Overview of How The Scripts Works

  • The execution of the script can be broken into the following sequence of stages:
  1. Creating a view (named HITS) of TppProtein and Hits.
  2. Creating a mapping from GI/UniProt Accession/RefSeq to Uniprot Accession/EntrezGeneID from the idmapping_selected.tab.gz downloaded from UniProt.
  3. Querying the database for bait.dat/inter.dat/prey.dat (see saint_bait/saint_inter/saint_prey).
  4. Filter/report bait.dat/inter.dat/prey.dat records with missing identifiers / prey protein sequence lengths.
  5. Output filtered bait.dat/inter.dat/prey.dat records.
  • The main function follows this sequence. Thus, simply step through the main function (where execution starts), and look for the corresponding blocks of code. Everything is well documented, so it should be easily traceable.

Important Assumptions

Handling Technical Replicates

  • In the Emili lab we define a technical replicate as repeat Mass spectrometry runs of the same experimental sample or pull down.
  • In Prohits, technical replicates are represented as multiple bands (as in gel bands) associated with the same experiment.
  • This introduces a problem when exporting data using the export saint script as we are grabbing data based on the experiment and not based on the Band. This can introduce extra experimental runs to the data where multiple technical replicates exist. If one of the replicates are signifcantly poorer than another then it can potentially penalize that particular experiment. There are different ways of treating technical replicates. You can average them, sum them (as mass spectrometry is a stochastic process bringing up different numbers each run), get the max value or leave them as is. We have chosen to sum them for the purpose of this script.

Handling Identifier conversion

  • The primary identifier in Prohits is GI.
  • The primary identifier in the Emili lab is Uniprot
  • In order to get optimal mapping:
    • Uniprot to GI mapping is downloaded from: ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping
    • PROTEIN_ACCESSION, a view in the Prohits_protein db is built using the above file.
    • if a bait/prey maps (through PROTEIN_ACCESSION) to more than one protein, preferentially use the row whose uniprot is in the fasta file (if fasta file is provided)
    • if a bait/prey maps to multiple genes, choose the minimum entrez gene id (seems to give better coverage than the max, probably because smaller gene ids have existed longer and are better annotated)

Options

  • REQUIRED
  • --experiments, -e
    • The path to a tab-delimited input file, where
      • 1st column = ExpID # corresponding to ID of the Experiment table
      • 2nd column = 'T' OR 'C' # treat bait in experiment as treatment or control

File Options

  • --bait, -b
    • OPTIONAL
    • The path to an output file for bait data in tab-delimited format, where
      • 1st column = SampleID
      • 2nd column = BaitName
      • 3rd column = Control
    • default: bait.dat (i.e. ./ meaning the current working directory)
  • --baitlog
    • OPTIONAL
    • The path to an output file for bait data (as in the format described in --bait)
    • for samples involving bait proteins missing identifiers (EntrezGeneID, UniProt Accession, GeneName)
    • identifiers.
    • default: bait.log
  • --inter, -i
    • OPTIONAL
    • The path to an output file for inter data in tab-delimited format, where
      • 1st column = SampleID
      • 2nd column = BaitName
      • 3rd column = PreyName
      • 4th column = PreyUniquePeptides
    • default: inter.dat
  • --interlog
    • OPTIONAL
    • The path to an output file for inter data (as in the format described in --inter) for interactions involving prey proteins missing protein sequence length, or bait/prey proteins missing identifiers (EntrezGeneID, UniProt Accession, GeneName).
  • default: inter.log
  • --prey, -p
    • OPTIONAL
    • The path to an output file for prey data in tab-delimited format, where
      • 1st column = PreyName
      • 2nd column = PreyProteinLength
    • default: prey.dat
  • --preylog
    • OPTIONAL
    • The path to an output file containing prey protein identifiers for proteins missing protein sequence length, or identifiers (EntrezGeneID, UniProt Accession, GeneName).
    • default: prey.log

Database Options

  • --user
    • Prohits database username
    • default: root
  • --password
    • Prohits database password
    • default:
  • --host
    • Prohits database host
    • default: localhost
  • --database
    • Prohits database name
    • default: Prohits
  • --debug, -d
    • Print all output files to standard output.
    • default: off

Advanced Options

  • --filter, -f
    • OPTIONAL
    • Specify criteria on which to filter the inter.dat and prey.dat output using an SQL statement. Filter statements must use the following identifiers in their query, which are aliases for database fields:
      • # Alias = DatabaseField
      • UniquePeptides = HITS.Pep_num_uniqe
      • Peptides = HITS.Pep_num
      • SearchEngine = HITS.SearchEngine
      • Coverage = HITS.Coverage
      • Expect = HITS.Expect
      • Accession = HITS.ProteinAcc
      • AccType = HITS.AccType
      • Examples:
        • ./export_saint.sql --experiment exp.csv --filter "UniquePeptides > 1" - only include hits that have more than 1 unique peptides
        • ./export_saint.sql --experiment exp.csv --filter "Peptides > 10" - only include hits that have more than 10 peptides
        • /export_saint.sql --experiment exp.csv --filter "AccType not like 'RM'" - exclude all reverse matches
        • ./export_saint.sql --experiment exp.csv --filter "AccType <> 'RM'" - exclude all reverse matches (alternate syntax)
        • ./export_saint.sql --experiment exp.csv --filter "Accession not like 'Q%'" - exclude all hits whose accession starts with the letter Q
    • NOTES:
      • - For SearchEngine, the possible values (as of this writing) are Mascot, GPM, TPP_Mascot, TPP_GPM, and Sequest_Statquest. In general however, the possible values are all Hits.SearchEngine values, and all TppProtein.SearchEngine values with "TPP" appended to the front (except Sequest_Statquest).
      • - HITS does not refer to the Hits table, it is a VIEW created on TppProtein and Hits tables. To find out exactly which fields these HITS fields map to for TppProtein and Hits, refer to the create_hits_view function.
  • --fasta
    • OPTIONAL
    • Calculate sequence lengths for prey proteins using a fasta file with headers of the format:
      • >sp|UNIPROT_ACCESSION|UNIPROT_ID
    • NOTE:
    • - Prey proteins must be identified by uniprot accession in the Prohits database
      • (i.e. in TppProtein.ProteinAcc).
  • --keep
    • OPTIONAL
    • Specifies which group of inter.dat/prey.dat/bait.dat records to keep for prey which are missing information. Such information includes prey protein sequence length, and bait/prey identifiers (EntrezGeneID, UniProt Accession, GeneName). The following values are possible, with corresponding effect:
      • none:
        • Remove inter.dat/prey.dat records involving prey proteins for which identifiers or sequence lengths cannot be resolved, or inter.dat/bait.dat records involving bait proteins for which identifiers could not be resolved.
      • genename:
        • Remove inter.dat/prey.dat records involving prey proteins for which sequence lengths cannot be resolved, but keep those inter.dat/prey.dat/bait.dat records for which bait/prey protein identifiers could not be resolved.
      • all:
        • Do not filter out any inter.dat/prey.dat/bait.dat records for which identifiers or protein sequence lengths cannot be resolved.
    • default: genename
  • --rebuild
    • OPTIONAL
    • If this option is set, the Prohits_proteins.PROTEIN_ACCESSION table will be rebuilt by extracting data from the /home/james/workspace/prohits/Prohits/script/resource/idmapping_selected.tab.gz file. This should be performed when /home/james/workspace/prohits/Prohits/script/resource/idmapping_selected.tab.gz is updated (i.e. re-downloaded because it was deleted).
    • default: off
  • --warn, --nowarn
    • OPTIONAL
    • If this option is set, the user will be warned with prompts (e.g. yes/no) before performing certain actions (e.g. database modification during --rebuild). Prompts will occur during the following stages:
      • - yes/no prompt when --rebuild is set. This operation first starts by
        • deleteing a large table, then rebuilding it from a file, which will be time
        • consuming.
    • default: on
Powered by Google Project Hosting