biclustering


Improving performances of suboptimal greedy iterative biclustering heuristics via localization


This Project web page includes experimental data and the code implementations that accompany our work published in the Bioinformatics Journal Improving performances of suboptimal greedy iterative biclustering heuristics via localization.


PAPER ABSTRACT


Motivation:

Biclustering gene expression data is the problem of extracting submatrices of genes and conditions exhibiting significant correlation across both the rows and the columns of a data matrix of expression values. Even the simplest versions of the problem are computationally hard. Most of the proposed solutions therefore employ greedy iterative heuristics that locally optimize a suitably assigned scoring function. Methods: We provide a fast and simple pre-processing algorithm called localization that reorders the rows and columns of the input data matrix in such a way as to group correlated entries in small local neighborhoods within the matrix. The proposed localization algorithm takes its roots from effective use of graph-theoretical methods applied to problems exhibiting a similar structure to that of biclustering. In order to evaluate the effectivenesss of the localization pre-processing algorithm, we focus on three representative greedy iterative heuristic methods. We show how the localization preprocessing can be incorporated into each representative algorithm to improve biclustering performance. Furthermore, we propose a simple biclustering algorithm, Random Extraction After Localization (REAL) that randomly extracts submatrices from the localization reprocessed data matrix, eliminates those with low similarity scores, and provides the rest as correlated structures representing biclusters.


Results:

We compare the proposed localization pre-processing with another pre-processing alternative, non-negative matrix factorization. We show that our fast and simple localization procedure provides similar or even better results than the computationally heavy matrix factorization pre-processing with regards to H-value tests. We next demonstrate that the performances of the three representative greedy iterative heuristic methods improve with localization preprocessing when biological correlations in the form of functional enrichment and PPI verification constitute the main performance criteria. The fact that the random extraction method based on localization REAL performs better than the representative greedy heuristic methods under same criteria also confirms the effectiveness of the suggested pre-processing method.


Availability

Supplementary document (Appendix to the paper) is available. Other supplementary material including code implementations in LEDA C++ library, experimental data, and the results are also available at http://hacivat.khas.edu.tr/~cesim/bicluster.html.


SUPPLEMENTARY FILES AND RESULTS


Experimental Results:

The paper includes a detailed discussion of several experimental evaluations. Here we present the source data of all the implemented experiments.

| Expression Matrix Analysis | Bicluster Structure Analysis | Enrichment Evalutation(see Table 2 of the paper) | PPI Network Evaluation| |:----------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------|


Implementation:

The implementation of each experiment is published as a separate file. The source code for the REAL algorithm as well as the binary files are downloadable. In order to recompile from source code, you will need LEDA C++ library 5.1 and over.

License Agreement

  1. You need to be an academic user to download and use this software. You may use this software for only academic research/teaching purposes.

  2. You may not use this software for commercial purposes.

  3. You may not distribute, modify or sell this software.

  4. If you agree the terms above, click on the File links below.

| Expression Matrix Analysis | Bicluster Structure Analysis | Enrichment Evalutation | PPI Network Evaluation| |:-----------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------|

The REAL algorithm source code package includes binary executables both for Linux(Compiled in Pardus) and WinXP. You can also recompile if you obtain LEDA C++ library 5.1 and Visual Studio 2005 or the gcc compiler.


Localized Datasets

Localized versions of two datasets widely used in biclustering studies are included. We also provide a comparison of our localization method with the nsNMF method which can be considered as an alternative way of preprocessing the expression data. nsNMF versions of the dartasets may include differences as compared to the original datasets due to its preprocessing.

| Yeast(2993x173) | Thaliana(734x69) | |:-----------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------|


Project Information

Labels:
Bioinformatics Biclustering Academic Localization GeneExpression REAL Ordering