My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
README  

Featured, Phase-Deploy
Updated Feb 4, 2010 by varuzza

INTRODUCTION

This package implements two significance tests for comparing digital gene profiles, described in the article:

Varuzza et al. "Significance tests for comparing digital gene expression profiles"

KempBasu package comprises two programs: Kemp for the frequentist test and Basu for the Bayesian test, and some auxiliary scripts.

SYSTEM REQUIREMENTS

Kempbasu dependencies are listed below:

GSL, GLIB and pkg-config are available in all Linux distributions. These libraries are also availiable for Mac OS X and Windows through MacPorts and Gygwin, respectively.

Judy source code is distributed with KempBasu source code.

To run the auxiliary scripts (including the wrapper script) the Ruby Programming Language Runtime is required.

COMPILING

Download and unpack KempBasu source code. Then unpack and compile Judy library:

cd <kempbasu directory>
cd ext-libs/
tar zxvf Judy-1.0.4.tar.gz
cd Judy-1.0.4
./configure
make

You may need administrator privileges (root) to install it. Use the following commands:

sudo make install

or

su make install

Optionally, you can install Judy in your account running configure with the prefix option:

./configure --prefix=$HOME

And then simply installing it with the command:

make install

If you decide to install Judy in your HOME, remember to type the commands below before running the configure script:

export CFLAGS=-I$HOME/include export LDFLAGS=-L$HOME/lib

After installing Judy, the same procedure should be used to compile and install kempbasu.

cd <kempbasu-directory>
./configure
make

...and then

make install

BINARIES

The package provides two binaries: kemp.bin and basu.bin. These programs are intended to be integrated with other programs that can provide a friendlier interface to the end user. KempBasu's input/output are very simple, thus allowing an easy parsing by other programs.

A set of scripts is provided to facilitate the use of KempBasu. For those who are interested in using KempBasu embedded in other programs, the description of the kemp.bin and basu.bin input and output files are provided in the "RUNNING LOW LEVEL" section.

RUNNING

For running both programs, the input file must be formatted in a table, with fields separated by tabs, as follows:

ID lib_1 lib_2 ... lib_k
SUM S_1 S_2 ... S_k
tag_1 c_11 c_12 ... c_1k
... ... ... ... ...
tag_M c_M1 c_M2 ... c_Mk

...where S_j is the sum of corresponding library and c_ij is the count of tag i in the library j. For example, the file examples/example.dat of distribution has the following content:

TEST  T1 T2
SUM   10000	10000
tag1  1		3
tag2  7		21
tag3  10	30

To run Kemp, type:

kemp <filename>

or to run Basu, type:

basu <filename>

This commands invokes a wrapper ruby script which converts the input file for the format needed by the underlying C program. In Linux, the script also determines the number of available processor cores and then runs the C program with the maximum available number of cores.

The output is a file with a name

<filename>
-kemp.txt or
<filename>
-basu.txt. The aforementioned example will generate the following output when executed with kemp:

TEST	T1	T2	pvalue	        alpha	        score	category
tag1	1	3	0.625886	0.0337644	0	U
tag2	7	21	0.012574	0.014972	1.60165	D
tag3	10	30	0.002223	0.0130005	8.29007	D

The output reproduces input data, plus some extra columns:

pvalue The significance level
alpha The critical level (the cutoff of the pvalue)
score The score of the tag. 10(1-pvalue/alpha) for differentiated ones and zero for others.
category U for undifferentiated and D for differentiated, decided according to the pvalue and alpha.

The output of Basu program is:

TEST	T1	T2	evalue	        ev ie
tag1	1	3	0.61952	        2.3368e-05
tag2	7	21	0.033387	4.5486e-06
tag3	10	30	0.0077514	3.1912e-06

Again, the first columns correspond to the original data and the extra columns are:

evalue The Bayesian significance level.
ev ie The error due to the numerical integration.

EXAMPLES

The directory examples, contains a test file, GSE6677-clean.dat.gz, compressed with gzip program. To test Kemp and Basu, using this example file, type:

gunzip examples/GSE6677-clean.dat.gz
kemp examples/GSE6677-clean.dat

The other files, with a .mat extension, are formatted for the low-level programs kemp.bin and basu.bin (described below).

AUXILIARY SCRIPTS

The package is provided with 3 Ruby scripts:

kempbasu.rb The aforementioned script to run kemp.bin and basu.bin. This script can be executed by running the symbolic links kemp and basu, respectively. The script decides which program to run based on the FILE global variable.
cutoff.rb Remove very low count tags. Sometimes you want to remove tags with count below some threshold. This script calculates the sum of each tag, normalized by the size of the smallest library, and then remove the tags below the threshold.
convert_table.rb Convert a high-level input format to the low-level input format used by kemp.bin and basu.bin

RUNNING LOW LEVEL

The command line options of kemp.bin are

  kemp.bin [OPTION...]  <matrix name>

Help Options:
  -?, --help                 Show help options

Application Options:
  --save-temp                Save per thread temporary results (for debug)
  -c, --cutoff-pars=file     Parameters of cutoff function
  -n, --nprocs=N             Number of processors

And for basu.bin is:

  basu.bin [OPTION...]  <matrix name>

Help Options:
  -?, --help         Show help options

Application Options:
  --save-temp        Save per thread temporary results (for debug)
  -n, --nprocs=N     Number of processors

The input matrix is formatted as:

M+1 k
S1	S2	...	Sk
X11	X12	...	X1k
...	...	...	...
XM1	XM2	..	XMk

M+1 is the number of rows in the file, and k is the number of columns. For example, the content of file examples/test4.mat is:

6 3
5929	7460	592   
144	221	14
397	404	40
200	250	20
2000	2500	200
20	100	2

The output of kemp.bin is stored in the file

<filename>
-kemp. It contains solely program's results in the same tag order of the input file:

pvalue	   alpha	score	category
0.157513   0.00155825	0	U
0.008553   0.000720078	0	U
0.996594   0.00126509	0	U
0.974522   0.000136022	0	U
0	   0.00467107	10	D

The output file of basu.bin is

<filename>
-basu, the content is:

evalue	 ev ie
0.25438	 5.4679e-05
0.016736 5.8991e-05
0.99991  3.268e-08
0.99289	 1.8689e-06
0	 6.2679e-05

KEMP CUTOFF FUNCTION PARAMETERS

Two set of cutoff function parameters are provided. The file kemp.pars contains the values calculated for weights (a=4,b=1), whereas file kemp11.pars contains the values for weights (a=4,b=1). If no parameter file is informed in the command line, Kemp will search for the file in this following locations:

$HOME/.kempbasu/kemp.pars
$PWD/kemp.pars
/etc/kemp.pars
/usr/local/etc/kemp.pars

KEMPBASU LIBRARY

All the code for calculating the significance levels is in the library kempbasu.so. Other programs can be linked to this library to directly use call the Kemp and Basu functions. A binding to a script language can be also be done. However, the API still needs a clean up and will be changed in future. The documentation about KempBasu API will be provided just after this API refactoring.


Sign in to add a comment
Powered by Google Project Hosting