sig2biopax

Sig2BioPAX is a comand-line Java program that can be used to convert structured text files describing molecular interactions into the BioPAX Level 3 standard format

Sig2BioPAX: Java tool for converting flat text files to BioPAX Level 3 format

Abstract

The World Wide Web plays a critical role in enabling researchers to exchange, search, process, visualize, integrate and analyze experimental data. Such efforts can be further enhanced through the development of the concept of the semantic web. The semantic web idea is to enable machines to understand data through the development of protocol free data exchange formats such as Resource Description Framework (RDF) and the Web Ontology Language (OWL). These standards provide formal descriptors of objects, object properties and their relationships within a specific knowledge domain. However, the overhead of converting datasets typically stored in data tables such as Excel or other types of spreadsheets into RDF or OWL formats is not trivial for non-specialists and as such produce a barrier to seamless data exchange between researchers, databases and analysis tools. This problem is particular of need and importance in the field of network systems biology where biochemical interactions between genes and their products are abstracted to networks. For the purpose of converting biochemical interactions into the BioPAX format, the leading standard developed by the computational systems biology community, we developed an open-source command line tool that takes as input tabular data describing different types of biochemical interactions. The tool converts such interactions into the BioPAX level 3 OWL format. We used the tool to convert several existing and novel mammalian networks of protein interactions, signaling pathways and transcriptional regulatory networks into BioPAX and deposited these into PathwayCommons a repository for consolidating and organizing biochemical networks. Our command line tool sig2biopax is a useful resource that can enable experimental and computational systems biologists to contribute their identified networks for integration and reuse with the research community.

Running Sig2BioPAXv4

Sig2BioPAXv4 is packaged as an executable JAR file. You must have Java Virtual Machine installed on your computer. JVM is available from http://www.java.com/getjava. To run the GUI (graphical user interface) version of Sig2BioPAXv4, simply double click the sig2biopaxv4.jar, which is the distribution file. To use the command line version of the program open a command prompt and navigate to the folder containing the Sig2BioPAXv4.jar. Enter the command:
java –jar sig2biopaxv4.jar -cmd args, where args are the arguments you wish to supply as described below. For example, to use input file foo.txt, output file bar.owl, with the overwrite option, the command is: java -jar sig2biopaxv4.jar -cmd -in: foo.txt -out:bar.owl -o

If no arguments are used, the default input is input.txt, the default output is output.owl, and a sig input template, as well as the non-overwriting option will be used.

Input file types

The command line tool may be accessed by using the command line argument –cmd. In the command line tool, there are four different options which may be fed into the program as command-line arguments separated by spaces. The four options are:
1. Input File name. This is specified by the syntax -in:filename, where filename is the path to the input file. If no input file is specified, the default input.txt will be automatically attempted by the program. The file may be specified with either: name only, or directory structure + name. If name only, the program will search for the file in the same directory as the EXE. IMPORTANT – if the directory has spaces in the name, this argument must be surrounded by double quotation marks, “”.
2. Output File name. This is specified by the syntax -out:filename, where filename is the path to the output file. If no output file is specified, the default output.owl will be used. The file may be specified with either: name only, or directory structure + name. If name only, the program will create the output file in the same directory as the EXE. If directory + name, you must create the directory yourself first or an exception will be thrown. IMPORTANT – if the directory has spaces in the name, this argument must surrounded by the double quotation marks, “”.
3. Overwrite Output: -o . This switch, if given, will cause the program to erase either the default output file or the output file that was specified previously. The default is OFF, i.e., don't overwrite. In this case, a number will be appended onto the end of the output file name.
4. Input Template Name. This is specified by the syntax -t:type, where type is the typename of the desired input template. Currently there are three supported templates. First, the default, sig. This option will parse files having the following line syntax:
SN SHA SMA ST SL TN THA TMA TT TL E TOI PID
KEY:
SN = SourceName: Name of source molecule
SHA = SourceHumanAccession: Source Swiss-Prot human accession number
SMA = SourceMouseAccession: Source Swiss-Prot mouse accession number
ST = SourceType: Type of source molecule
SL = SourceLocation: Location of source molecule in the cell
TN = TargetName: Name of target molecule
THA = TargetHumanAccession: target Swiss-Prot human accession number
TMA = TargetMouseAccession: target Swiss-Prot mouse accession number
TT = TargetType: Type of target molecule
TL = TargetLocation: Location of target molecule in the cell
E = Effect: Effect of source on target. + (activating), _ (deactivating), or 0 (neutral)
TOI = TypeOfInteraction: Reaction type definition
PID = PubMedID: ID of article that identified this reaction

The second format can be chosen using the string argument source_target. This option tells the program to parse the input file as having only six columns:
SN SL TN TL E TOI PID
The third format is tf_target. This format is for converting transcription-factor target-gene interaction pairs. The field names (columns) are SourceName TargetName and PubMedID. Sometimes, in this format, the PubMedID comes appended to the SourceName like this: SourceName-PubMedID. If this is the case, Sig2BioPAX will strip off the PubMedID from the SourceName.

Real example

The focal adhesome network sig file from the http://www.adhesome.org'>adhesome.org web site can be converted into BioPAX Level 3 by typing on the command-line:
java –jar sig2biopaxv4.jar -cmd –in:fa.sig –out:fa.owl –o
The input and output files can be viewed here: http://sig2biopax.googlecode.com/files/fa.sig'>fa.sig and http://sig2biopax.googlecode.com/files/fa.owl'>fa.owl

Contact

avi dot maayan at mssm dot edu and
ryan dot webb at mssm dot edu

Project Information

License: GNU GPL v3
2 stars
svn-based source control

Labels:
SystemsBiology BioPAX Protein-proteinInteractions NetworkBiology SignalingNetworks GeneRegulatryNetworks InteroperabilityTool PathwayCommons SBCNY

Code