My favorites | Sign in
Project Logo
                
Search
for
Updated Mar 03, 2009 by zh.charlie
Labels: Phase-Design, performance
ChemicalSubstructureSearchinSQL  

#Paper by Adel Golovin and Kim Henrick, chemical substructure search on a relational database by use of a standard SQL query.

Published as Chemical Substructure Search in SQL

Representation of molecule structure in relational database

SMILES is spanning tree. Spanning tree, nodes and connections between nodes. Direction of the connection is not necessary. Repesentation of the span tree with user defined tables

Query structures

The basic idea to query compounds with structure information specified should be

select *
from
    node as n1,
    node as n2,
    ...
    node as nm,
    bond as b1,
    ...
    bond as bk
where
    n1.compoundid = n1.compoundid,  -- Nodes and Bonds in same compound
    ...
    nm.compoundid = n1.compoundid,
    b1.compoundid = n1.compoundid,
    ...
    bk.compoundid = n1.compoundid,

    n1.nodetype = 'C',              -- Type of nodes and bonds
    n2.nodetype = 'O',
    ...
    b1.bondtype = '-',
    b2.bondtype = '=',

    (b1, n1, n2) = True,            -- Connections
    (b4, n2, n5) = True,

Optimazed query structures

Using the data structure descripted in the paper, SQLMOL procedure

exec search_by_smi 'N(=O)(=O)ccc'

processing the query to a dynamic SQL command as below. Please note that the only table involved in the query is sqlmol_compound_bonds and the number of joined table is the number of bonds in the query structure.

select distinct b4_1.compoundid 
from  
    /* bonds included in the query */
    sqlmol_compound_bonds b4_1,  sqlmol_compound_bonds b8_1,
    sqlmol_compound_bonds b10_1,  sqlmol_compound_bonds b11_10,
    sqlmol_compound_bonds b12_11 
where 1=1  
    /* bonds are in same compound */
    and b4_1.compoundid=b8_1.compoundid and
    b4_1.compoundid=b10_1.compoundid and
    b4_1.compoundid=b11_10.compoundid and
    b4_1.compoundid=b12_11.compoundid 

    /* bonds type, atom type is included in the bond type */
    and b11_10.btid=5 and b12_11.btid=5 and b10_1.btid=10 and
    b4_1.btid=13 and b8_1.btid=13 

    /* bonds connection */
    and b4_1.ceid1=b10_1.ceid2 and b8_1.ceid1=b10_1.ceid2 and
    b8_1.ceid1=b4_1.ceid1 and b11_10.ceid2=b10_1.ceid1 and
    b12_11.ceid2=b11_10.ceid1 

    /* trading symmetic bonds */
    and b11_10.ceid2<>b12_11.ceid2 and b12_11.ceid2<>b12_11.ceid1

Performance issues

The currently testing is only carried on my laptop and on SQL Server 2008 express edition, the data set is small as 1,000 molecule records. The query is fast, less then 1 second. Using Openbabel in the same machine takes much longer (about 5 sec) and holding full work of CPU processor.

Further testing work is on schedule. The follow picture shows the execution plan generated by SQL Server. It is composed by index seeks, nested loops and hash matches.

Query by SMARTS

Only wild card feature in SMARTS (* and ~) is supported.


Sign in to add a comment
Hosted by Google Code