gencat


Gene's Comprehensive Annotation Tool

Introduction


Deep transcriptome sequencing (RNA-seq) is able to recover those information that may be missed by previous array-based technologies, due to its high sensitivity, high throughput nature and more importantly, no prior knowledge of transcript sequence is needed. It has been extensively used to discover new genes, novel splicing isoforms or disease related chimeras (such as gene fusion). Target RNA sequencing reveals that the range, depth and complexity of human transcriptome is far from fully characterized; many novel genes, new isoforms, rare transcripts remain undiscovered. Unprecedented sequencing capacity provided by next generation sequenicing (NGS) platform make it possible to identify these "dark matters", nevertheless, how to annotate these "new genes" is even more important and there is no such tools available. A New gene Annotation process involves filtering out false positives, predicting coding potential and integration with other knowledge. genCAT is designed to fulfill these tasks, it is an open platform that allow users to incorporate as many datasets (concepts) as possible to annotate the input gene list, as long as these datasets are prepared in bigwig , BED, BAM/SAM formats. These file formats are very popular and flexible enough to accommodate to different kinds of NGS data (RNA-seq, ChIP-seq, DNA methylation, SNPs etc).

Features


  • Automatically recognize RNA-seq experiments. Pair-end or single-end, strand-specific or not. If strand-specific, automatically determine how paired reads were stranded and calculate strand specificity.
  • Automatically recognize SAM or BAM files.
  • Precisely determine coding status of newly identified genes or isoforms.
  • precisely define the ORF (Open Reading Frame) region for protein coding gene.
  • quickly associate known concepts (epegenetics markers, SNPs etc) to newly identified genes or isoforms.

Installation


Prerequisite: * gcc * python2.7 * numpy (pre-installed in some system). If your computer can not connnect to internet, nose>= 0.10.4 and distribute-0.6.10 are also required.

The following is example installation on Linux system. You need to change '--root' directory, PYTHONPATH and PATH accordingly

  1. tar zxf genCAT-VERSION.tar.gz
  2. cd genCAT-VERSION
  3. python setup.py install will install genCAT in system level. or
  4. python setup.py install --root=/home/user/gencat will install genCAT at user-level.
  5. export PYTHONPATH=/home/user/gencat/usr/local/lib/python2.7/site-packages:$PYTHONPATH
  6. export PATH=/home/user/gencat/usr/local/bin:$PATH

The following is example installation on MAC OSX. You need to change '--root' directory, PYTHONPATH and PATH accordingly. NOTE: To install genCAT on MAC OSX, user need to download and install Xcode beforehand.

  1. tar zxf genCAT-VERSION.tar.gz
  2. cd genCAT-VERSION
  3. python setup.py install will install genCAT in system level. or
  4. python setup.py install --root=/home/user/gencat will install genCAT at user-level.
  5. export PYTHONPATH=/home/user/gencat/Library/Python/2.7/site-packages/:$PYTHONPATH
  6. export PATH=/home/user/gencat/usr/local/bin:$PATH Installation on Windows has not been tested.

Configure file Example


Configure file is the only required input of genCAT. It is a plain text file that store all the concepts (datasets) user want to use to annotate input gene list. Each row in configure file is a concept. There are two types of concepts: PRIMARY CONCEPT and ASSOCIATE CONCEPT. PRIMARY CONCEPT is used to define user's concept (usually a set of genes in BED format, refernece gene model in BED format and reference genome in FASTA fromat), while ASSOCIATE CONCEPTS are used for specifying which datasets will be used to annotate PRIMARY CONCEPT. Each concept is comprised of at least 3 columns (separaed by spaces or tab): * Key Word: keywords of primary concepts are reserved and can NOT be modified by user. But users can arbitrarily specify keywords of assocated concepts unless they are unique in concept_list file. The key words of associated concepts are used to label the corresponding concepts in final output table. * File type: PRIMARY CONCEPT accepts BED and FASTA format. ASSOCIATE CONCEPTS accept BED, BigWig and BAM/SAM files. Any dataset should be prepared in these formats. For each entry in the "INPUT_GENE" file, This Program will profile signal for all the BigWig files, calculate expression value (RPKM) for all the SAM/BAM files, do intersection for all the BED files. * Absolute Path: Absolute path of the concept files

Please note that: * Lines (rows) starts with '#' will be ignored * For BED files in associate concepts, user need to specificy 3 additional options (type, up and down). 1. type=0: TSS-up, TSS-down (window centered on TSS) 1. type=1: TSS-up, TES-down (gene body + flanking window) 1. type=2: TES-up, TES-down (window centered on TES) 1. type=3: CDSS-up, CDSS-down (window centered on CDSS) 1. type=4: CDSS-up, CDSE-down (Coding region + flanking window) 1. type=5: CDSE-up, CDSE-down (window centered on CDSE) 1. up = upstream distance limit (bp) added to feaure (defined by "type") 1. down = downstream distance limit (bp) added to feaure (defined by "type")

Explanation for TSS, TES, CDSS and CDSE.

https://sites.google.com/site/liguowangspublicsite/home/gene_model.png

Example of configure file

Prebuilt Concepts (update 01/19/2012)


Conservation Score

TO DO list


  • Support GTF file
  • Add logistic regression model to predict a input gene coding or noncoding
  • Call Variant (SNP) from input SAM/BAM file, and associate with known variants

    Contact


  • Liguo Wang (wangliguo78 AT gmail.com or liguow AT bcm.edu)
  • Deqiang Sun (deqiangs AT bcm.edu)
  • Wei Li (wl1 AT bcm.edu)

Project Information

The project was created on Jan 25, 2012.

Labels:
Bioinformatics RNA-SEQ Linux Mac Python annotation genCAT