My favorites | Sign in
Project Logo
                
Search
for
Updated Dec 10, 2009 by marecki
MegatestSetup  
How to set everything up to be able to run megatests.

Introduction

Megatests is a special class of tests performed on Pygr code, which differ from regular tests in that they require significant amounts of input data, disc space and/or CPU time. As such, they are not run automatically by the test suite.

The aim of most existing megatests is to ensure adequate performance of Pygr under heavy load. Running such tests is particularly important during active development of code, as it allows one to quickly detect any changes negatively impacting performance; on the other hand, they can also be used to benchmark Pygr performance or, indirectly, observe performance-degrading system problems on a certain machine or machines.

The purpose of this page is to provide complete instructions for everyone wishing to run megatests on his/her system - from obtaining the necessary data through actual running of megatests to automatic periodic running and reporting.

Details

Requirements

You will need the following to be able to run megatests:

Downloading and preparing data

Data files need by Pygr megatests can be divided into three categories: sequence data in Pygr's seqdb.BlastDB format, NLMSA files for different tests, and miscellaneous input/output files. The latter two are installed differently from the former one; both procedures will be described here.

Last but not least, since NLMSA-building megatests are run for both file and SQL storage back-ends it is necessary to import data from the last category above into a MySQL database. For your convenience we have provided MySQL dump-files which can be used for this purpose.

Presently there are two distinct classes of megatests, differing in what the primary genome used by each class is and therefore named after the genome in question: dm2 (Drosophila melanogaster, or common fruit fly) and hg18 (Homo sapiens, or human). Each class uses its own set of input and output data; it is recommended to keep them in separate directories.

SequenceFileDB sequence files

The easiest way of obtaining SequenceFileDB sequence-data files is to fetch them using Pygr itself, from the UCLA XML-RPC server - that way downloaded files will automatically become registered into the local Pygr resource database. Information on how to do this can be found on the PygrResourceDownloader page; for your convenience, the lists below provide data-set names in the format understood by Pygr.

The following sequences must be obtained:

  1. For dm2 megatests
    • Bio.Seq.Genome.ANOGA.anoGam1
    • Bio.Seq.Genome.APIME.apiMel2
    • Bio.Seq.Genome.DROAN.droAna3
    • Bio.Seq.Genome.DROER.droEre2
    • Bio.Seq.Genome.DROGR.droGri2
    • Bio.Seq.Genome.DROME.dm2
    • Bio.Seq.Genome.DROMO.droMoj3
    • Bio.Seq.Genome.DROPE.droPer1
    • Bio.Seq.Genome.DROPS.dp4
    • Bio.Seq.Genome.DROSE.droSec1
    • Bio.Seq.Genome.DROSI.droSim1
    • Bio.Seq.Genome.DROVI.droVir3
    • Bio.Seq.Genome.DROWI.droWil1
    • Bio.Seq.Genome.DROYA.droYak2
    • Bio.Seq.Genome.TRICA.triCas2
  2. For hg18 megatests
    • Bio.Seq.Genome.ANOCA.anoCar1
    • Bio.Seq.Genome.BOVIN.bosTau3
    • Bio.Seq.Genome.CANFA.canFam2
    • Bio.Seq.Genome.CAVPO.cavPor2
    • Bio.Seq.Genome.CHICK.galGal3
    • Bio.Seq.Genome.DANRE.danRer4
    • Bio.Seq.Genome.DASNO.dasNov1
    • Bio.Seq.Genome.ECHTE.echTel1
    • Bio.Seq.Genome.ERIEU.eriEur1
    • Bio.Seq.Genome.FELCA.felCat3
    • Bio.Seq.Genome.FUGRU.fr2
    • Bio.Seq.Genome.GASAC.gasAcu1
    • Bio.Seq.Genome.HORSE.equCab1
    • Bio.Seq.Genome.HUMAN.hg18
    • Bio.Seq.Genome.LOXAF.loxAfr1
    • Bio.Seq.Genome.MACMU.rheMac2
    • Bio.Seq.Genome.MONDO.monDom4
    • Bio.Seq.Genome.MOUSE.mm8
    • Bio.Seq.Genome.ORNAN.ornAna1
    • Bio.Seq.Genome.ORYLA.oryLat1
    • Bio.Seq.Genome.OTOGA.otoGar1
    • Bio.Seq.Genome.PANTR.panTro2
    • Bio.Seq.Genome.RABIT.oryCun1
    • Bio.Seq.Genome.RAT.rn4
    • Bio.Seq.Genome.SORAR.sorAra1
    • Bio.Seq.Genome.TETNG.tetNig1
    • Bio.Seq.Genome.TUPGB.tupBel1
    • Bio.Seq.Genome.XENTR.xenTro2
  3. For the restartIterator megatest (note significant overlap with dm2 megatests; also see the comment in the next section)
    • Bio.Seq.Genome.ANOGA.anoGam1
    • Bio.Seq.Genome.APIME.apiMel3
    • Bio.Seq.Genome.DROAN.droAna3
    • Bio.Seq.Genome.DROER.droEre2
    • Bio.Seq.Genome.DROGR.droGri2
    • Bio.Seq.Genome.DROME.dm3
    • Bio.Seq.Genome.DROMO.droMoj3
    • Bio.Seq.Genome.DROPE.droPer1
    • Bio.Seq.Genome.DROPS.dp4
    • Bio.Seq.Genome.DROSE.droSec1
    • Bio.Seq.Genome.DROSI.droSim1
    • Bio.Seq.Genome.DROVI.droVir3
    • Bio.Seq.Genome.DROWI.droWil1
    • Bio.Seq.Genome.DROYA.droYak2
    • Bio.Seq.Genome.TRICA.triCas2

Once the files have been downloaded they require no further attention.

NLMSA and other files

Pygr megatests can be divided into two classes depending on whether they require NLMSA to be pre-built in a controlled environment or not. The first class consists of all dm2 and hg18 megatests, the second - of the restartIterator megatest.

If pre-built NLMSA are required

The necessary files are available (as tar archives) on the Web, at http://biodb.bioinformatics.ucla.edu/MEGATEST/ . Download the archives and unpack them into directories of your choice. You need the following files:

  1. NLMSA for dm2 megatests
    • maf_data.tar
    • maf_test.tar
  2. NLMSA for hg18 megatests
    • axt_data3.tar
    • maf_data3.tar
    • maf_test3.tar
  3. Miscellaneous files, needed by both classes
    • input_and_results.tar

This time some post-installation steps are necessary before the data can be used: the files dm2_multiz15way.seqDictP (from maf_test.tar) and hg18_multiz28way.seqDictP (from maf_test3.tar) contain hardcoded paths which will need to be changed to reflect your directory structure. Assuming the final path components are to stay the same (i.e. you keep the data in the directories in which they came in the archives), simply open the files in question using an ordinary text editor and replace all the occurrences of result/pygr_data and /result/pygr_megatest with the path(s) of your choice.

If pre-built NLMSA are not required

Simply download the Bio.MSA.UCSC.dm3_multiz15way alignment using Pygr, the same way you have downloaded all the sequence files. This has the added benefit of Pygr being able to resolve sequence dependencies of the alignment - in other words, should any required sequences be missing from the local resource database they shall be downloaded automatically.

The download test

Since version 0.8.1 Pygr uses a new version of the download megatest which uses a local HTTP server to provide the desired file, thus reducing the test's dependence on a fast and stable network connection. Of course that means you will have to download the necessary file, i.e. a text dump of an NLMSA, first... We recommend http://biodb.bioinformatics.ucla.edu/PYGRDATA/dm2_multiz9way.txt.gz - it's the same file as what the older versions of this test used, it's large but not too large and building it can take advantage of sequence data required by other megatests.

MySQL data

You can find gzip-compressed MySQL dump files (produced with version 5) at http://biodb.bioinformatics.ucla.edu/MEGATEST/. Simply create a new database on your server, download all the .sql.gz files and import them into the said database using e.g. the standard MySQL client (mysql).

Configuration

MySQL access

Megatests assume the database they use is located on the default MySQL server and accessed using default user name/password. If your system-wide defaults do not match the desired values of these parameters, you'll need to override them - using a standard MySQL option file. Under Linux/Unix you will most likely use the per-user option file $HOME/.my.cnf file in your home directory

have it contain something like this:

[client]
port=3306
host=your_database_server
user=your_account
password=your_password

For more information on the subject of MySQL option files, see http://dev.mysql.com/doc/refman/5.1/en/option-files.html.

The config file

Database access aside, configuration of Pygr megatests is performed entirely by setting appropriate keywords in appropriate files. At present, megatests and the associated tools search for their configuration the following files:

  1. .pygrrc in the user's home directory;
  2. pygr.cfg in the user's home directory;
  3. .pygrrc in the current directory;
  4. pygr.cfg in the current directory.

All of the keywords listed below can be found in any of these files. They are read in the order listed here, overriding old values with new ones should a keyword appear in more than one.

The config files follow standard syntax understood by Python's ConfigParser module, i.e. very similar to that of Windows INI files. Among other things this means keywords in a file are divided into sections. Megatests use keywords from four sections: megatests for general configuration, megatests_dm2 and megatests_hg18 for settings pertaining to specific input data sets and megatests_download for downloader-specific options.

The download test

The version of the download megatest made available since Pygr 0.8.1 requires one to specify where the test's built-in HTTP server is to find the NLMSA file to serve for downloading. This can be done by setting the httpdServedFile keyword in megatest_download to the path and name of that file. One can also optionally specify httpdPort to override the default TCP port (28145) to be occupied by the built-in HTTP server.

Note: the download megatest in 0.8.1 has a bug in parsing httpdPort which prevents the test from running. To work around that problem, set httpdPort in the config file and change line 38 of tests/downloadNLMSA_megatest.py from

server_addr = ('127.0.0.1', httpdPort)

to

server_addr = ('127.0.0.1', int(httpdPort))

Choosing the variant

Both data sets used by megatests are quite large, making running tests over them in their entirely quite time consuming - for example, on a machine with a 2.8 GHz dual-core Opteron CPU and a SATA-2 RAID disc a single such run takes approximately 30 hours! Therefore, it may be desirable to run megatests only on subsets of the two data sets. In order to do this, specify appropriate subsets using the smallSampleKey keyword in data set-specific sections. For example, to only use chrYh in the annotation_dm2 megatest, chr4h in nlmsa_dm2 and chrY in hg18-based ones, specify:

[megatests_dm2]
smallSampleKey = chrYh
smallSampleKey_nlmsa = chr4h

[megatests_hg18]
smallSampleKey = chrY

On the aforementioned machine this reduces the running time of megatests to approximately 12 minutes per run.

In principle, any valid subsets could be used to have "quick" megatests. Then again, we only provide reference output files for the configuration shown above.

Location of input

Use the following keys to specify directories containing input data:

  1. In the megatests section:
  2. In the megatests_SET sections:

By definition, all of these keywords must be set for megatests to run.

Location of output

All files produced in the course of running megatests will be written in randomly-generated subdirectories of the directory pointed to by the testOutputBaseDir keyword in the megatests section; this keyword must be set for megatests to run. You will of course need write access there, along with enough free space. Note that all the files produced there are temporary and can safely be deleted after the end of a test run, if they are not deleted automatically (which they should).

In addition, log files from all stages of a run are written to the directory pointed to by the logDir keywords in the megatests section. These are not deleted automatically.

Timing

While running megatests on changing code it is helpful to keep an eye on how much time their execution takes. Out runner script (see below) makes it possible to automate this process by setting two keywords in the megatests section:

Reporting

If you use our scripts (see below) for running megatests, each run should end with an e-mail being sent notifying its recipients of the outcome. The scripts attempt to determine the outcome of the test run and select appropriate recipients.

Presently, a test run is considered to have failed if one or more of the following statements are true at its end:

The following keywords, all in the megatests section, are used to control the reporting process:

Scripts

The directory tests/tools in the Pygr source tree contains two scripts which can be used to facilitate the running of megatests:

run_megatests downloads Pygr sources using Git, builds Pygr and runs both standard tests and megatests, storing the output. At the end an e-mail is sent using send_megatest_email.py from downloaded sources, after which everything except output logs is deleted. The script has been designed for running via cron - if $HOME is not set, as it is often the case for cron jobs, it looks for MySQL and Pygr configuration files in a directory of user's choice.

Note: at present the script doesn't use the Pygr configuration file, as it is not trivial to parse such files in shell scripts. You'll need to specify appropriate settings in the file itself, near the beginning.


Sign in to add a comment
Hosted by Google Code