gencat

Gene's Comprehensive Annotation Tool

Introduction

Deep transcriptome sequencing (RNA-seq) is able to recover those information that may be missed by previous array-based technologies, due to its high sensitivity, high throughput nature and more importantly, no prior knowledge of transcript sequence is needed. It has been extensively used to discover new genes, novel splicing isoforms or disease related chimeras (such as gene fusion). Target RNA sequencing reveals that the range, depth and complexity of human transcriptome is far from fully characterized; many novel genes, new isoforms, rare transcripts remain undiscovered. Unprecedented sequencing capacity provided by next generation sequenicing (NGS) platform make it possible to identify these "dark matters", nevertheless, how to annotate these "new genes" is even more important and there is no such tools available. A New gene Annotation process involves filtering out false positives, predicting coding potential and integration with other knowledge. genCAT is designed to fulfill these tasks, it is an open platform that allow users to incorporate as many datasets (concepts) as possible to annotate the input gene list, as long as these datasets are prepared in bigwig , BED, BAM/SAM formats. These file formats are very popular and flexible enough to accommodate to different kinds of NGS data (RNA-seq, ChIP-seq, DNA methylation, SNPs etc).

Features

Automatically recognize RNA-seq experiments. Pair-end or single-end, strand-specific or not. If strand-specific, automatically determine how paired reads were stranded and calculate strand specificity.
Automatically recognize SAM or BAM files.
Precisely determine coding status of newly identified genes or isoforms.
precisely define the ORF (Open Reading Frame) region for protein coding gene.
quickly associate known concepts (epegenetics markers, SNPs etc) to newly identified genes or isoforms.

Installation

Prerequisite: * gcc * python2.7 * numpy (pre-installed in some system). If your computer can not connnect to internet, nose>= 0.10.4 and distribute-0.6.10 are also required.

The following is example installation on Linux system. You need to change '--root' directory, PYTHONPATH and PATH accordingly

tar zxf genCAT-VERSION.tar.gz
cd genCAT-VERSION
python setup.py install will install genCAT in system level. or
python setup.py install --root=/home/user/gencat will install genCAT at user-level.
export PYTHONPATH=/home/user/gencat/usr/local/lib/python2.7/site-packages:$PYTHONPATH
export PATH=/home/user/gencat/usr/local/bin:$PATH

The following is example installation on MAC OSX. You need to change '--root' directory, PYTHONPATH and PATH accordingly. NOTE: To install genCAT on MAC OSX, user need to download and install Xcode beforehand.

tar zxf genCAT-VERSION.tar.gz
cd genCAT-VERSION
python setup.py install will install genCAT in system level. or
python setup.py install --root=/home/user/gencat will install genCAT at user-level.
export PYTHONPATH=/home/user/gencat/Library/Python/2.7/site-packages/:$PYTHONPATH
export PATH=/home/user/gencat/usr/local/bin:$PATH Installation on Windows has not been tested.

Configure file Example

Configure file is the only required input of genCAT. It is a plain text file that store all the concepts (datasets) user want to use to annotate input gene list. Each row in configure file is a concept. There are two types of concepts: PRIMARY CONCEPT and ASSOCIATE CONCEPT. PRIMARY CONCEPT is used to define user's concept (usually a set of genes in BED format, refernece gene model in BED format and reference genome in FASTA fromat), while ASSOCIATE CONCEPTS are used for specifying which datasets will be used to annotate PRIMARY CONCEPT. Each concept is comprised of at least 3 columns (separaed by spaces or tab): * Key Word: keywords of primary concepts are reserved and can NOT be modified by user. But users can arbitrarily specify keywords of assocated concepts unless they are unique in concept_list file. The key words of associated concepts are used to label the corresponding concepts in final output table. * File type: PRIMARY CONCEPT accepts BED and FASTA format. ASSOCIATE CONCEPTS accept BED, BigWig and BAM/SAM files. Any dataset should be prepared in these formats. For each entry in the "INPUT_GENE" file, This Program will profile signal for all the BigWig files, calculate expression value (RPKM) for all the SAM/BAM files, do intersection for all the BED files. * Absolute Path: Absolute path of the concept files

Please note that: * Lines (rows) starts with '#' will be ignored * For BED files in associate concepts, user need to specificy 3 additional options (type, up and down). 1. type=0: TSS-up, TSS-down (window centered on TSS) 1. type=1: TSS-up, TES-down (gene body + flanking window) 1. type=2: TES-up, TES-down (window centered on TES) 1. type=3: CDSS-up, CDSS-down (window centered on CDSS) 1. type=4: CDSS-up, CDSE-down (Coding region + flanking window) 1. type=5: CDSE-up, CDSE-down (window centered on CDSE) 1. up = upstream distance limit (bp) added to feaure (defined by "type") 1. down = downstream distance limit (bp) added to feaure (defined by "type")

Explanation for TSS, TES, CDSS and CDSE.

Example of configure file

Prebuilt Concepts (update 01/19/2012)

Conservation Score

Description: File format is bigwig. Conservation score of hg19 was calculated from alignments of 46 vertebrates genomes. Conservation score of mm9 was calculated from alignments of 30 vertebrate genomes. The original files were downloaded from UCSC genome browser.
hg19 PhastCon Score
hg19 PhyloP Score
mm9 PhastCon Score
mm9 PhyloP Score
md5sum.txt

Potential etiologic SNPs
Description: File format is BED. For more information click here
gwascatalog.bed

Somatic Mutations related to human cancer
Description: File format is BED. Compiled from COSMIC Database For more information click here
non_coding_variant
coding point and fusion mutations

TO DO list

Support GTF file
Add logistic regression model to predict a input gene coding or noncoding
Call Variant (SNP) from input SAM/BAM file, and associate with known variants
Contact

Liguo Wang (wangliguo78 AT gmail.com or liguow AT bcm.edu)
Deqiang Sun (deqiangs AT bcm.edu)
Wei Li (wl1 AT bcm.edu)

Project Information

The project was created on Jan 25, 2012.

License: GNU GPL v2
2 stars
git-based source control

Labels:
Bioinformatics RNA-SEQ Linux Mac Python annotation genCAT

Code

Archive