Heuristic


Surface chemical features

In MED-SuMo, the notion of Chemical Features is fundamental. Every macromolecular structure is first converted into a set of chemical features. Only Features which are available for interacting with ligands are selected and are named Surface Chemical Features (SCF). Several types of chemical features are defined by default but can be easily modified. Each of them will represent a given property. Only chemical features of the same type can be compared and possibly considered as equivalent. The default dictionary of SCFs such as H-Bonds, formal charges, hydrophobic and aromatic groups is shown below in Fig.1.

Fig.1: Chemical Functions are: imidazole, positive (Positive formal charge), delta_plus (H-bond donor), glycine (Glycine residue), amide (Side chain amide group), structural water (Any water molecule annotated as SWT in PDB file), aromatic (aromatic ring), proline (Proline residue), guanidinium, acyl, hydrophobic, thioether, hydroxyl, other (metals), negative (negative formal charge), delta_minus (H-bond acceptor) and thiol.

Macromolecule surfaces are described, compared and superposed as graphs of SCFs triplets

The comparison heuristics that has been designed for MED-SuMo is based on the construction of triplets of chemical groups which are considered as the minimal unit for a biological function. These triplets of chemical features contain different kinds of information and are the vertices of a graph that is the main data structure which is used for comparing 3D structures in MED-SuMo. The result of comparisons consists of one or more matching triplets. MED-SuMo accounts for flexibility in this comparison.

Fig.2: MED-SuMo comparison procedure. (1) Graph construction. (a) Surface Chemical Features (SCFs) are displayed on the protein structure through a lexicographic analysis of the PDB files. (b) Their positions and orientations are checked to discard SCF potentially involved in internal interactions or associated to buried atoms. (c) SCFs are gathered in triangles. (d) The triangle network is then stored as a graph data structure with the triangles as vertices and with edge connecting adjacent triangles. (2) Graph Comparison. (e) The query graph (in green color) is compared to the database graphs (in pink color), compatible triangles are selected, i.e., they are formed by compatible SCFs. (f) Multiple corresponding graphs are found.

Two kinds of MED-SuMo databases: sites database and full surface databases

In Drug Design applications, it's the best technology to take advantage of the ongoing exponential growth of the public Protein Data Bank[3] where all 3D experimental protein-ligands are stored. MED-SuMo is a fast and reliable technology to query and mine the biggest available macromolecules 3D structural database.

The reference database, the Protein DataBank is freely and publicly available and contains (update april 2008) the 3D atomic coordinates of:

136,000 ligands bound to macromolecules (8,000 are distinct)

50,000 macromolecules (47,000 proteins)


A MED-SuMo site database contains the graphs of SCFs triplets which are in the environnement of a ligand. This environnement is defined by a maximum distance between atoms of the ligand and chemical features of 4.5 Å or 6.0 Å in most cases (user defined). 6.0Å corresponds to a broader binding site definition around a ligand and is a better choice for site detection and functionnal annotation. 4.5 Å is more suited to drug design applications. A ligand is defined as a set of heteroatoms or small peptides. A full surface database corresponds to the whole surface, e.g. all features but burried or involved in intraprotein h-bonds. The corresponding MED-SuMo databases encodes the whole PDB:

Site database containing 136,000 ligand sites description

Full surface database which contains 50,000 full surface description (protein, RNA and DNA can be described)


MED-SuMo server parameters

The triangle network of chemical features can be tuned to be more or less dense. The maximum length of an edge and the maximum sum of the three edges are tunable parameters when the database is generated:

High density triangle network (BEST): parameters 20-60

Default density triangle network (FAST): parameters 13-39

When surface chemical matches, they are eventually tested for having a similar shape environment. This shape threshold is a tunable parameter when the comparison is ran:

Default shape threshold: ST=65%

Lower shape threshold to allow more tolerance in the shape comparison: ST=45%

4 comparison modes can be exploited

References


[1] Jambon M, Imberty A, Deléage G, Geourjon C “A new bioinformatic approach to detect common 3D sites in protein structures” Proteins: Struct., Funct., and Gen. 52:137-145 (2003)
[2] Jambon M, Andrieu O, Combet C, Deléage G, Delfaud F, Geourjon C “The SuMo server : 3D search for protein functional sites” Bioinformatics Vol 21, n°20, 3929-3930 (2005)
[3] H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, P.E. Bourne: “The Protein Data Bank” (2000) Nucleic Acids Research, 28 pp. 235-242.

Printable Page
Print this page