|
ChemicalSubstructureSearchinSQL
#Paper by Adel Golovin and Kim Henrick, chemical substructure search on a relational database by use of a standard SQL query. Published as Chemical Substructure Search in SQL Representation of molecule structure in relational databaseSMILES is spanning tree. Spanning tree, nodes and connections between nodes. Direction of the connection is not necessary. Repesentation of the span tree with user defined tables
Query structuresThe basic idea to query compounds with structure information specified should be select *
from
node as n1,
node as n2,
...
node as nm,
bond as b1,
...
bond as bk
where
n1.compoundid = n1.compoundid, -- Nodes and Bonds in same compound
...
nm.compoundid = n1.compoundid,
b1.compoundid = n1.compoundid,
...
bk.compoundid = n1.compoundid,
n1.nodetype = 'C', -- Type of nodes and bonds
n2.nodetype = 'O',
...
b1.bondtype = '-',
b2.bondtype = '=',
(b1, n1, n2) = True, -- Connections
(b4, n2, n5) = True,
Optimazed query structuresUsing the data structure descripted in the paper, SQLMOL procedure exec search_by_smi 'N(=O)(=O)ccc' processing the query to a dynamic SQL command as below. Please note that the only table involved in the query is sqlmol_compound_bonds and the number of joined table is the number of bonds in the query structure. select distinct b4_1.compoundid
from
/* bonds included in the query */
sqlmol_compound_bonds b4_1, sqlmol_compound_bonds b8_1,
sqlmol_compound_bonds b10_1, sqlmol_compound_bonds b11_10,
sqlmol_compound_bonds b12_11
where 1=1
/* bonds are in same compound */
and b4_1.compoundid=b8_1.compoundid and
b4_1.compoundid=b10_1.compoundid and
b4_1.compoundid=b11_10.compoundid and
b4_1.compoundid=b12_11.compoundid
/* bonds type, atom type is included in the bond type */
and b11_10.btid=5 and b12_11.btid=5 and b10_1.btid=10 and
b4_1.btid=13 and b8_1.btid=13
/* bonds connection */
and b4_1.ceid1=b10_1.ceid2 and b8_1.ceid1=b10_1.ceid2 and
b8_1.ceid1=b4_1.ceid1 and b11_10.ceid2=b10_1.ceid1 and
b12_11.ceid2=b11_10.ceid1
/* trading symmetic bonds */
and b11_10.ceid2<>b12_11.ceid2 and b12_11.ceid2<>b12_11.ceid1Performance issuesThe currently testing is only carried on my laptop and on SQL Server 2008 express edition, the data set is small as 1,000 molecule records. The query is fast, less then 1 second. Using Openbabel in the same machine takes much longer (about 5 sec) and holding full work of CPU processor. Further testing work is on schedule. The follow picture shows the execution plan generated by SQL Server. It is composed by index seeks, nested loops and hash matches.
Query by SMARTSOnly wild card feature in SMARTS (* and ~) is supported. |
Sign in to add a comment