| Issue 1: | Statquest functionality for Prohits |
1 of 4
Next ›
|
| 2 people starred this issue and may be notified of changes. | Back to list |
Statquest is filtering algorithm (similiar to the TPP pipeline) used by the Emililab. The output from this method need to be parsed in the Prohit database. Statquest is run on Sequest output and generates an html or text file with scored protein hits and their associated scored peptide hits. Important fields from the Statquest output file (* fields are needed): Protein information --> likely going to TPPProtein Table 1. (*) Locus = Prey/Hit 2. (*) Confidence = Protein probability or score 3. (*) # of unique peptides 4. (*) Total # of peptides 5. (*) Sequence Coverage 6. (*) Length 7. MolWt - molecular weight 8. pI 9. Description Peptide Information --> likely going to TPPPeptide Table 1. (*) Unique - is this peptide unique in the database? * indicates it is unique 2. (*) Filename - of spectra 3. (*) XCorr - sequest x-correlation 4. (*) DeltaCN - sequest delta CN 5. (*) Confidence - Confidence/Probability of the peptide 6. Rank by Sp 7. Ion proportion 8. (*) Copies 9. (*) Sequence Additional information we need from the file: 1. database searched 2. search engine used 3. parameters file?
Jan 27, 2011
The input will only be one file. There shouldn't be a huge difference between the stt and txt file so you could use either. Any info that is missing in the file but is found in the dta file should simply be added to the stt file as part of the header.
Jan 31, 2011
I have a couple questions/concerns, some of which are general and others specific to protein/peptide info: General: 1) Many of the fields for newly inserted data will end up being NULL; is this alright? 2) In the Prohits database, many columns are called GeneID; is this always a GI? Protein Info: I've mapped the protein info data without much trouble. The mappings for protein_id and gene_id indicate that I will try and use the Prohits_proteins DB to fill in fields in Prohits, such as in Hits (e.g. LocusTag, GeneID). 1) I cannot find where to put the protein length. However, is this really necessary, since can't we assume that we should be able to find the length by querying the Sequence in the Prohits_proteins.Protein_Sequence? # .stt protein info to TppProtein mapping source # Hits.AccType protein_id # Protein_Accession.Acc gene_id # Protein_AccessionIPI.Acc, Protein_Accession.UniProtID confidence # TppProtein.PROBABILITY unique_peptides # TppProtein.UNIQUE_NUMBER_PEPTIDES total_peptides # TppProtein.TOTAL_NUMBER_PEPTIDES coverage # TppProtein.PERCENT_COVERAGE length # ? Peptide Info: 1) I am still at a loss as to where much of the data belongs for peptides. Can you clarify where it belongs? # .stt peptide info to TppPeptide mapping unique # ? spectra_file # TppPeptide.XmlFile xcorr # ? deltacn # ? confidence # TppPeptide.Probability copies # ? sequence # TppPeptide.Sequence Thanks, James
Labels:
-Type-Enhancement Type-Other
Feb 4, 2011
Regarding the "Unique" field for peptides in the StatQuest file, how would you like me to encode this in the database? For instance, does this field serve the same purpose of Hits.Pep_num / TppProtein.TOTAL_NUMBER_PEPTIDES? If so, given this list of example "Unique" values, are these mapping to peptide numbers correct? (Unique => Peptide_num) * => 1 +1 => 2 +2 => 3 ... Thanks, James
Cc:
rr.weinberger vinnyvinnyvinny
Feb 4, 2011
Unique can be thought of as a boolean type. If there is a star then this peptide is unique in the fasta file, if there is anything other than a star then this peptide is not unique in the fasta file. the '+1',... just indicates that there are 2 instance of this peptide in the fasta file. I would make the field unique boolean, true if '*' and false otherwise. Vincent, is there a need to track to what degree the peptide is not unique?
Feb 7, 2011
For the .stt file, some protein headers look as follows: >rm|00019264 99.58% 1 2 17.2% 16659.07 0 What type of identifier is 00019264 (I'm guessing "rm" should tell me what this is, but I'm not sure what that is either). Thanks, James
Feb 7, 2011
In order to help calculate our false positive rate from the ms search engine perspective we take the sequence database that we are using to search against and reverse all the sequences and append it to the databases. rm|00019264 is an example of a false hit coming up even after statquest filtering. I think that it is still ok to put it into the database as using TPP I have seen some of these records get entered as well. It lo0ks like currently the TPP parser just takes the "rm" and enters it as the protein id.
Feb 7, 2011
Sure, we can keep track of the degree of uniqueness in the database but currently we cannot use existing interface to search it. Most of people will look things at protein level.
Feb 8, 2011
After doing some grep-ing in all the source code, and in the Prohits (and Prohits_proteins) database dumps, I can't find any evidence for 'rm' occurring either in a script (e.g. writing it to a file) or in a database field. Actually, I only found it occur in the GeneName and GeneAliase fields of Protein_Class: EntrezGeneID LocusTag GeneName GeneAliase TaxID Description Status BioFilter 252529 FBgn0003258 rm - 7227 rimy <null> <null> 252532 FBgn0003262 rmp rm 7227 rumpled <null> <null> If you want, I can simply go ahead and stick "rm" in the ProteinAcc field of TppProtein, but I'd like to know if I'm simply missing something first, or if something is amiss. Also, if I do this, what should the AccType (accession type) be? Also, should I simply ignore 00019264 (I don't understand what this identifier is). Thanks, James
Feb 9, 2011
Look at how the existing parser works, we do want to be consistent. Look at the updated database on tin.emililab.edu to see the entry for "rm" proteins, I think they throw out the identifier. The identifier is just an arbitrary number which when we run the reverse script it creates goes from 1 to xxxxxx. The first protein in fasta file matches 1 in reverse but sequence reversed. It is not useful to keep that information, as long as we know it is reverse protein. Vincent Fong
Mar 29, 2011
Hi James, Sorry for the late response but what was the resolution of the reverse match peptides (rm) issue mentioned in comment 8? Thanks, Ruth |
393 KB View Download