Issue 1: Statquest functionality for Prohits
Project Member Reported by rr.weinb...@gmail.com, Jan 24, 2011
Statquest is filtering algorithm (similiar to the TPP pipeline) used by the Emililab.  The output from this method need to be parsed in the Prohit database.

Statquest is run on Sequest output and generates an html or text file with scored protein hits and their associated scored peptide hits. 

Important fields from the Statquest output file (* fields are needed):
Protein information --> likely going to TPPProtein Table
1. (*) Locus = Prey/Hit
2. (*) Confidence = Protein probability or score
3. (*) # of unique peptides
4. (*) Total # of peptides
5. (*) Sequence Coverage
6. (*) Length
7. MolWt - molecular weight
8. pI
9. Description

Peptide Information --> likely going to TPPPeptide Table
1. (*) Unique - is this peptide unique in the database? * indicates it is unique
2. (*) Filename - of spectra
3. (*) XCorr - sequest x-correlation
4. (*) DeltaCN - sequest delta CN
5. (*) Confidence - Confidence/Probability of the peptide
6. Rank by Sp
7. Ion proportion
8. (*) Copies
9. (*) Sequence

Additional information we need from the file:
1. database searched
2. search engine used
3. parameters file?



Jan 27, 2011
Project Member #1 chocomoo...@gmail.com
Regarding the 4 different files vincent sent me, it seems we only focused on one when we met.  In particular, the file JO_100317_ATRIP-N-1R_B1_01.mzXML_dta.stt.  Also, all the fields discussed above are fields in that file.  Is this the only file format the script must parse then?
JO_100317_ATRIP-N-1R_B1_01.mzXML_dta.stt
393 KB   View   Download
Jan 27, 2011
Project Member #2 rr.weinb...@gmail.com
The input will only be one file.  There shouldn't be a huge difference between the stt and txt file so you could use either.  Any info that is missing in the file but is found in the dta file should simply be added to the stt file as part of the header.
Jan 31, 2011
Project Member #3 chocomoo...@gmail.com
I have a couple questions/concerns, some of which are general and others specific to protein/peptide info:
General:
1) Many of the fields for newly inserted data will end up being NULL; is this alright?
2) In the Prohits database, many columns are called GeneID; is this always a GI?

Protein Info:
I've mapped the protein info data without much trouble.  The mappings for protein_id and gene_id indicate that I will try and use the Prohits_proteins DB to fill in fields in Prohits, such as in Hits (e.g. LocusTag, GeneID).
1) I cannot find where to put the protein length.  However, is this really necessary, since can't we assume that we should be able to find the length by querying the Sequence in the Prohits_proteins.Protein_Sequence?
# .stt protein info to TppProtein mapping
source          # Hits.AccType
protein_id      # Protein_Accession.Acc
gene_id         # Protein_AccessionIPI.Acc, Protein_Accession.UniProtID
confidence      # TppProtein.PROBABILITY
unique_peptides # TppProtein.UNIQUE_NUMBER_PEPTIDES
total_peptides  # TppProtein.TOTAL_NUMBER_PEPTIDES
coverage        # TppProtein.PERCENT_COVERAGE
length          # ?


Peptide Info:
1) I am still at a loss as to where much of the data belongs for peptides.  Can you clarify where it belongs?
# .stt peptide info to TppPeptide mapping
unique       # ?
spectra_file # TppPeptide.XmlFile
xcorr        # ?
deltacn      # ?
confidence   # TppPeptide.Probability
copies       # ?
sequence     # TppPeptide.Sequence

Thanks,
James
Labels: -Type-Enhancement Type-Other
Feb 4, 2011
Project Member #4 chocomoo...@gmail.com
Regarding the "Unique" field for peptides in the StatQuest file, how would you like me to encode this in the database?  For instance, does this field serve the same purpose of Hits.Pep_num / TppProtein.TOTAL_NUMBER_PEPTIDES?  If so, given this list of example "Unique" values, are these mapping to peptide numbers correct?
(Unique => Peptide_num)
*  => 1
+1 => 2
+2 => 3
...

Thanks,
James
Cc: rr.weinberger vinnyvinnyvinny
Feb 4, 2011
Project Member #5 rr.weinb...@gmail.com
Unique can be thought of as a boolean type.  If there is a star then this  peptide is unique in the fasta file, if there is anything other than a star then this peptide is not unique in the fasta file.  the '+1',... just indicates that there are 2 instance of this peptide in the fasta file.  

I would make the field unique boolean, true if '*' and false otherwise.

Vincent, is there a need to track to what degree the peptide is not unique?
Feb 7, 2011
Project Member #6 chocomoo...@gmail.com
For the .stt file, some protein headers look as follows:

>rm|00019264	99.58%	1	2	17.2%		16659.07	0	

What type of identifier is 00019264 (I'm guessing "rm" should tell me what this is, but I'm not sure what that is either).

Thanks,
James
Feb 7, 2011
Project Member #7 rr.weinb...@gmail.com
In order to help calculate our false positive rate from the ms search engine perspective we take the sequence database that we are using to search against and reverse all the sequences and append it to the databases.  rm|00019264 is an example of a false hit coming up even after statquest filtering.

I think that it is still ok to put it into the database as using TPP I have seen some of these records get entered as well. It lo0ks like currently the TPP parser just takes the "rm" and enters it as the protein id.


Feb 7, 2011
Project Member #8 vinnyvin...@gmail.com
Sure, we can keep track of the degree of uniqueness in the database
but currently we cannot use existing interface to search it.  Most of
people will look things at protein level.
Feb 8, 2011
Project Member #9 chocomoo...@gmail.com
After doing some grep-ing in all the source code, and in the Prohits (and Prohits_proteins) database dumps, I can't find any evidence for 'rm' occurring either in a script (e.g. writing it to a file) or in a database field.  Actually, I only found it occur in the GeneName and GeneAliase fields of Protein_Class:
EntrezGeneID	LocusTag	GeneName	GeneAliase	TaxID	Description	Status	BioFilter
252529	FBgn0003258	rm	-	7227	rimy	<null>	<null>
252532	FBgn0003262	rmp	rm	7227	rumpled	<null>	<null>
 
If you want, I can simply go ahead and stick "rm" in the ProteinAcc field of TppProtein, but I'd like to know if I'm simply missing something first, or if something is amiss.  Also, if I do this, what should the AccType (accession type) be?

Also, should I simply ignore 00019264 (I don't understand what this identifier is).

Thanks,
James

Feb 9, 2011
Project Member #10 vinnyvin...@gmail.com
Look at how the existing parser works, we do want to be consistent.
Look at the updated database on tin.emililab.edu to see the entry for
"rm" proteins,  I think they throw out the identifier.  The identifier
is just an arbitrary number which when we run the reverse script it
creates goes from 1 to xxxxxx.  The first protein in fasta file
matches 1 in reverse but sequence reversed.  It is not useful to keep
that information, as long as we know it is reverse protein.

Vincent Fong
Mar 29, 2011
Project Member #11 rr.weinb...@gmail.com
Hi James, 
Sorry for the late response but what was the resolution of the reverse match peptides (rm) issue mentioned in comment 8?
Thanks, 
Ruth