Chemoinformatics Lecture 2 Storing and Searching Chemical Data

Chemical Data • Chemical data is special • Chemical names are important (but inconvenient)

Handling chemical data What do we want to do with our chemical information? •

Chemical file formats • Every man and his dog has developed their own chemical

Typical information in molecular structure files Information about the whole molecule: • molecule name

PDB file format • The Protein Data Bank (PDB) file format was developed by

SD file format • Developed by Molecular Design Limited (MDL). • Can store 2

Chemical Line Notations • Representations of molecules that fit on a single line. E.

There are many line notations 6 -Dimethylamino-4 -phenylamino-naphthalene-2 -sulfonic acid • Wiswesser line notation:

SMILES • SMILES is the most widely used and most useful chemical line notation

Example smiles Ethanol Acetic acid Cyclohexane Pyridine Trans-2 -butene L-alanine Sodium chloride CCO CC(=O)O

Generate SMILES using Chemdraw Edit – Copy as – smiles

SMILES – simple examples SMILES string Compound C Methane CC Ethane N Ammonia [NH

SMILES – inorganic elements • Non-organic elements must be enclosed in square brackets. •

SMILES - bonds • Single bonds are implied. They can optionally be represented using

SMILES - branches • Branches are represented by enclosing the sidechain in parentheses ‘()’

SMILES - rings • Rings are specified by using numbers to create ‘ring closures’.

SMILES – rings Cyclohexane – we need to break one bond in the ring

SMILES – aromaticity Furan C 1=COC=C 1 c 1 cocc 1 Pyridine N 1=CC=CC=C

SMILES – you have a go • Write a SMILES for phenol Oc 1

SMILES - Additional notations • SMILES contains additional features which can be used to

SMILES - benefits • Smiles is essentially a language with simple letters, bonds and

Chemical Databases • Chemical databases are important in all stages of medicinal chemistry •

ZINC The ZINC database (http: //zinc. docking. org/) collects together commercially available compounds, converts

Chemical structures are special • The important distinction between chemical database software and other

Chemical Features • We can extend our chemical structure searching to pharmacophores by defining

Comparing molecules • In order to be able to search a database for chemical

Comparing molecules • Direct comparison of molecular structures is reasonably straightforward for a trained

Comparing molecules with fingerprints • A common approach to molecular comparisons makes use of

Fragments of phenol One atom fragments: Two atom fragments: Three atom fragments Four atom

Fragments of 2 -hydroxypyridine n 1 ccc(O)cc 1 • • One atom fragments: c,

Molecular fingerprints • A set of molecular fragments for a particular molecule can be

Comparing fingerprints • For example, bit strings for phenol (Ph) and 2 -hydroxypyridine (2

Molecular fingerprints • These bit strings only contain fragments up to three atoms in

Substructure fingerprints • Molecular databases use fingerprints to rapidly compare molecules, or to identify

Matching fingerprints • Then we need to compare the fragment fingerprint with the fingerprints

Similar? • Automobile • Made in Bavaria • 200 hp • 1250 Kilo weight

Similar? • Six-cylinder normally aspirated engine • Four-cylinder turbocharged engine • Four-speed gearbox •

Similar? • 3. 2 Liter engine • 2. 0 Liter engine • normally aspirated

Fingerprints 200 ps 250 ps Four speed Six speed Front-wheel drive Four-wheel drive Normally

Chemical Diversity and Similarity • Molecular similarity (or dissimilarity) means different things in different

Chemical Diversity and Similarity • It is often useful to be able to find

Chemical Diversity and Similarity • Molecular (dis)similarity can be measured in a large number

Example 1 naphthalene 2 clozapine 3 piperidine 1. 000 0. 088 0. 000 0.

Lead Hopping • Ideas of molecular similarity are interesting to drug designers, because it

The statins • Perhaps the best known example of lead hopping is the ‘statins’.

Statins Bound conformations of (left) atorvastatin and mevastatin and (right) atorvastatin and cerivastatin. Note

Shape-based lead hopping • In another example, Rush et al. (J. Med. Chem. 2005,

Shape-based lead hopping • Wyeth identified a lead compound by HTS screening and crystallised

Shape-based lead hopping Bound structure of the lead compound in Zip. A

Shape-based lead hopping • The researchers then used a shape-based comparison (Rapid Overlay of

ROCS similarity of 0. 9 (i. e. mostly shape similarity) does not correspond to

ROCS • A number of active compounds were identified. They report the crystal structure

Conclusion • This lecture has provided some background to chemical databases, similarity methods and

Slides: 56

Download presentation

Chemoinformatics Lecture 2 Storing and Searching Chemical Data

Chemical Data • Chemical data is special • Chemical names are important (but inconvenient) • Atoms connected by bonds can be thought of as a group of objects (atoms) that are connected together in a particular way (bonds)

Handling chemical data What do we want to do with our chemical information? • Display chemical compounds – 2 D – 3 D • Search for – Structures – Substructures – Similar structures (2 D or 3 D) – Chemical reactions • Name to structure conversion (and vice versa) The tenth collective index of Chemical Abstracts consisted of 75 volumes and weighed 170 kg. It contained nearly 24 million entries.

Chemical file formats • Every man and his dog has developed their own chemical structure format… • Conversion between formats can be performed by programs like babel (http: //openbabel. org) > babel -L formats abinit -- ABINIT Output Format [Read-only] acr -- ACR format [Read-only] adf -- ADF cartesian input format [Write-only] adfout -- ADF output format [Read-only] alc -- Alchemy format arc -- Accelrys/MSI Biosym/Insight II CAR format [Read-only] axsf -- XCry. SDen Structure Format [Read-only] bgf -- MSI BGF format box -- Dock 3. 5 Box format bs -- Ball and Stick format c 3 d 1 -- Chem 3 D Cartesian 1 format c 3 d 2 -- Chem 3 D Cartesian 2 format cac -- CAChe Mol. Struct format [Write-only] caccrt -- Cacao Cartesian format cache -- CAChe Mol. Struct format [Write-only] cacint -- Cacao Internal format [Write-only] can -- Canonical SMILES format car -- Accelrys/MSI Biosym/Insight II CAR format [Read-only] castep -- CASTEP format [Read-only] ccc -- CCC format [Read-only] cdx -- Chem. Draw binary format [Read-only] cdxml -- Chem. Draw CDXML format cht -- Chemtool format [Write-only] cif -- Crystallographic Information File ck -- Chem. Kin format cml -- Chemical Markup Language cmlr -- CML Reaction format com -- Gaussian 98/03 Input [Write-only] CONFIG -- DL-POLY CONFIG CONTCAR -- VASP format [Read-only] copy -- Copy raw text [Write-only] crk 2 d -- Chemical Resource Kit diagram(2 D) crk 3 d -- Chemical Resource Kit 3 D format csr -- Accelrys/MSI Quanta CSR format [Writeonly] cssr -- CSD CSSR format [Write-only] ct -- Chem. Draw Connection Table format

Typical information in molecular structure files Information about the whole molecule: • molecule name • journal article (for crystal structures) • creator or author(s) Information about each atom: • atomic element (H, He, C, N, O, F, etc. ) • atom name (E. g. in an amino acid N, CA, CB, CO O, etc. ) • Cartesian coordinates (X, Y, Z) or Z-matrix atom number • Atom charge (formal and/or partial) • residue name (E. g. for a protein: Ala, Pro, etc. ) • temperature factor and occupancy for crystal structures Bonding information: • usually stored as a connection table which describes which atoms are bonded together. • information about bond-orders (single, double, aromatic, etc. ) is important, but it is not always stored in some file formats (E. g. pdb)

PDB file format • The Protein Data Bank (PDB) file format was developed by the Brookhaven National Laboratory to store protein crystal structure information • Used by many molecular modelling programs • The PDB format has limitations: – Columns are of fixed size – Does not contain information about bond orders (these are recorded in a separate database) • The Databank has developed new formats to replace the PDB format. e. g. the mm. CIF format (Macromolecular Crystallographic Information File) • Ref: http: //www. rcsb. org/ pdb/info. html.

SD file format • Developed by Molecular Design Limited (MDL). • Can store 2 D or 3 D structures • Can contain query structures which can contain variable atom and bond types. E. g an atom may be either nitrogen or carbon, or a bond could be either double or aromatic • Can store additional information such as biological activity data associated with the molecule -ISIS- 10229913002 D 13 13 0 0 0 0999 V 2000 -0. 0586 -1. 1517 0. 0000 C 0 0 1 0 0 0 0 0 -1. 7103 -0. 5379 0. 0000 C 0 0 0 . . 0. 6069 1. 4103 0. 0000 C 0 0 2 0 0 0 0 0 2. 8138 1. 3828 0. 0000 C 0 0 0 3. 9207 -0. 5379 0. 0000 C 0 0 0 -3. 9207 0. 7414 0. 0000 C 0 0 0 1. 2724 2. 1586 0. 0000 O 0 0 0 3. 9207 0. 7414 0. 0000 C 0 0 0 1 2 1 0 0 1 3 1 0 0 1 4 1 1 0 0 0 . . 1 5 1 6 0 0 0 6 10 1 0 0 10 13 1 0 0 12 16 1 0 0 15 18 2 0 0 M END > <Isis_internal_number> (2) 2 > <chemical_name> (2) Minaprine dihydrochloride > <smiles_code> (2) c 1(c 2 ccccc 2)(cc(c(NCCN 3 CCOCC 3)nn 1)C). Cl > <Plate position> (2) 66 $$$$ Bonds Data Record end Atoms

Chemical Line Notations • Representations of molecules that fit on a single line. E. g. standard structural formulas. These work well for linear compounds, but less well for rings… CH 3 CH 2 COOH Line notations are: • Compact • Generally human readable/understandable

There are many line notations 6 -Dimethylamino-4 -phenylamino-naphthalene-2 -sulfonic acid • Wiswesser line notation: 1 L 66 J BMR& DSWQ IN 1&1 – An early line notation (1949) that describes molecules as fragments. Used for databases but fell out of use because it is not very computer friendly • Rosdal: 1=-5 -=10=5, 10 -1, 1 -11 N-12 -=17=12, 3 -18 S-19 O, 18=20 O, 18=21 O, 822 N-23, 22 -24 – A linear representation of a connection table developed by Beilstein • SMILES: CN(C)C 1=CC=C 2 C (C(NC 3=CC=CC=C 3)=CC(S(=O)(O)=O)=C 2)=C 1 Developed by Dave Weninger and Daylight Chemical Systems • In. Chi – In. Ch. I=1 S/C 18 H 18 N 2 O 3 S/c 1 -20(2)15 -9 -8 -13 -10 -16(24(21, 22)23)1218(17(13)11 -15)19 -14 -6 -4 -3 -5 -7 -14/h 3 -12, 19 H, 1 -2 H 3, (H, 21, 22, 23) - A compact chemical representation developed by IUPAC • Sybyl Line Notation (SLN, Tripos)

SMILES • SMILES is the most widely used and most useful chemical line notation • Can be used as input/output in many programs • Simple SMILES strings resemble standard chemical nomenclature. The atoms commonly found in organic molecules (B, C, N, O, P, S, F, Cl, Br, I) are represented by the atomic element symbol. • Single bonds are implied between each atom. • Hydrogen atoms are not usually shown but can be included in square brackets

Example smiles Ethanol Acetic acid Cyclohexane Pyridine Trans-2 -butene L-alanine Sodium chloride CCO CC(=O)O C 1 CCCCC 1 c 1 cnccc 1 C/C=C/C N[C@@H](C)C(=O)O [Na+]. [Cl-]

Generate SMILES using Chemdraw Edit – Copy as – smiles

SMILES – simple examples SMILES string Compound C Methane CC Ethane N Ammonia [NH 3] Ammonia CCCCCCO 1 -hexanol

SMILES – inorganic elements • Non-organic elements must be enclosed in square brackets. • Charged atoms are also enclosed in square brackets. [Na+] Sodium ion [Al+3] Aluminium ion [NH 4+] Ammonium ion

SMILES - bonds • Single bonds are implied. They can optionally be represented using a dash ‘-‘ • Double and triple bonds are represented using ‘=’ and ‘#’ • Noncovalent bonds are specified using a full stop ‘. ’

SMILES - branches • Branches are represented by enclosing the sidechain in parentheses ‘()’ CC(=O)O Acetic acid OC(C)(C)C t-butyl alcohol

SMILES - rings • Rings are specified by using numbers to create ‘ring closures’. The number follows after the atom. • Lower case characters are used to specify aromatic rings C 1 CCCCC 1 cyclohexane c 1 ccccc 1 benzene n 1 ccccc 1 pyridine c 1 cccc 2 naphthalene

SMILES – rings Cyclohexane – we need to break one bond in the ring O 1 CCCCC 1 N 1 CCCCC 1

SMILES – aromaticity Furan C 1=COC=C 1 c 1 cocc 1 Pyridine N 1=CC=CC=C 1 n 1 ccccc 1 C 1=CC=CC=C 1 benzene but more usually c 1 ccccc 1

SMILES – you have a go • Write a SMILES for phenol Oc 1 ccccc 1 • Write a SMILES for this molecule CN 2 CCCC 2 c 1 cnccc 1

SMILES - Additional notations • SMILES contains additional features which can be used to describe chirality, double bond isomers (E, Z) and metal complexes. • These are described in more detail at http: //www. daylight. com/learn/ http: //www. daylight. com/dayhtml/doc/theory. smiles. h tml#RTFTo. CX 1

SMILES - benefits • Smiles is essentially a language with simple letters, bonds and rules • They are extremely compact and use little storage space • But I can write ethanol two ways • CCO • OCC • The two ‘words’ are different. What can we do about this if we want to search databases?

Chemical Databases • Chemical databases are important in all stages of medicinal chemistry • Databases may contain: – Chemical structure, reaction and synthetic data (e. g. Beilstein, Chemical Abstracts, the Merck Index) – Compound structure and synthesis information (e. g. An in-house compound registry) – Biological activity data such as in-house testing data or MDL Drug Data Report (MDDR) CAS registers 89 million compounds and 39 million patent and journal articles

ZINC The ZINC database (http: //zinc. docking. org/) collects together commercially available compounds, converts them to 3 D structures and creates a number of useful subsets for drug desing (druglike, leadlike, etc, etc).

Chemical structures are special • The important distinction between chemical database software and other database programs used for holding text or images is that a chemical database must be able to interpret chemical structures • In a chemical database it is desirable to be able to search for: – Individual exact compounds – Compounds containing a particular substructure – Compounds similar to a given structure

Chemical Features • We can extend our chemical structure searching to pharmacophores by defining chemical features such as hydrophobic groups, hydrogen bond donors and acceptors, aromatic rings, etc. • These features are arranged in space to form a pharmacophore query • Structures hit if they match the query structure • Different molecular conformations should be taken into account • Not all database systems can provide this capability Pharmacophore query Query + hit structure

Comparing molecules • In order to be able to search a database for chemical structures or substructures we must be able to efficiently compare chemical structures. This allows us to perform searches for – specific compounds – compounds containing a particular substructure

Comparing molecules • Direct comparison of molecular structures is reasonably straightforward for a trained chemist, but more difficult for a computer • The simplest approach, to compare every atom in one molecule to every one in a second molecule is slow, particularly if we wish to search a database containing thousands or even millions of entries • We therefore need a quick way (even approximate) way to compare two molecules • We can illustrate the idea of molecular fragments using SMILES strings.

Comparing molecules with fingerprints • A common approach to molecular comparisons makes use of molecular fragments. • A molecule is broken down into smaller structures of various sizes • We then create a molecular fingerprint by considering the fragments present in the molecule • In the table below, ‘ 1’ indicates the presence of the fragment ‘ 0’ denotes absence N C O S P NH Phenyl 1 1 1 0 0 1 1

Fragments of phenol One atom fragments: Two atom fragments: Three atom fragments Four atom fragments: Five atom fragments: Six atom fragments: Seven atom fragments: SMILES c, O cc, c. O ccc, cc. O cccc, ccc. O ccccc, cccc. O c 1 ccccc 1, ccccc. O Oc 1 ccccc 1

Fragments of 2 -hydroxypyridine n 1 ccc(O)cc 1 • • One atom fragments: c, n, O Two atom fragments: cn, cc, c. O Three atom fragments: ccn, cnc, ccc, nc. O, cc. O Four atom fragments: cccn, ccnc, nccc, ccc. O, cccc, cnc. O Five atom fragments: ncccc, cnccc, ccncc, cccc. O, cnc. O Six atom fragments: n 1 ccccc 1, Ocnccc, Occccn Seven atom fragments: n 1 ccc(O)cc 1

Molecular fingerprints • A set of molecular fragments for a particular molecule can be assembled to form a molecular fingerprint. • A fingerprint is a binary number made up of the digits 1 and 0. • Each position (bit) in the string denotes a possible molecular fragment. • The digit is set to 1 to denote that a particular fragment is present in the molecule

Comparing fingerprints • For example, bit strings for phenol (Ph) and 2 -hydroxypyridine (2 HPy) might look like this:

Molecular fingerprints • These bit strings only contain fragments up to three atoms in size. In principal you can include any size fragments. • The larger the fragments the more fingerprints that are needed. • A typical case is keys used by the organisation MDL. – For structure searching they have a set of 166 keys – For similarity searching a set of 960 keys is used • The advantage of molecular fingerprints is that they are very rapid to compare using Boolean arithmetic (AND, OR, NOT). • Two identical molecules will have the same fingerprints – regardless of orientation.

Substructure fingerprints • Molecular databases use fingerprints to rapidly compare molecules, or to identify a substructure within the database. • If we wish to search for a substructure, we first need to calculate the fingerprint

Matching fingerprints • Then we need to compare the fragment fingerprint with the fingerprints of the two molecules in our small database. • We have a substructure match when all bits set in the substructure also set in the search molecule. • Here we can see that the substructure matches 2 -hydroxypyridine but not phenol. • Fingerprints provide a method for rapidly searching large databases • An important aspect of molecular fingerprints – the process of compressing them, will not be discussed here.

How similar is similar? Discuss

Similar? • Automobile • Made in Bavaria • 200 hp • 1250 Kilo weight (empty)

Similar? • Six-cylinder normally aspirated engine • Four-cylinder turbocharged engine • Four-speed gearbox • Six-speed gearbox • Built in 1973 • Built in 2007

Similar? • 3. 2 Liter engine • 2. 0 Liter engine • normally aspirated • Turbocharged • Four-wheel drive • Front-wheel drive

Fingerprints 200 ps 250 ps Four speed Six speed Front-wheel drive Four-wheel drive Normally aspir. Built in 1973 Built in 2007 1 0 1 0 1 0 1 1 0 0 0 1 0 1 1 0 0 1 1 0 0 1 Made in Bavaria 2, 0 Liter Hubraum 3. 0 Liter 3. 2 Liter Rear-wheel drive Turbocharged PKW 1

Chemical Diversity and Similarity • Molecular similarity (or dissimilarity) means different things in different contexts. What molecular properties do we wish to compare? • Common structures – Common scaffolds – Common functional groups • Common physicochemical properties • Etc

Chemical Diversity and Similarity • It is often useful to be able to find molecules that are similar or dissimilar in some respects. • Two important applications are: – Lead hopping – the idea of finding lead molecules with similar biological activity to a known compound, but with a fundamentally different chemical structure – Generating structurally diverse screening libraries – we usually wish to make efficient use of HTS screening and make sure that we are not testing large numbers of very similar compounds (e. g. compounds substituted with different halogens: F, Cl, Br, etc)

Chemical Diversity and Similarity • Molecular (dis)similarity can be measured in a large number of different ways. • One important method, the Tanimoto Coefficient, makes use of molecular fragments. • This is a value between 0 and 1 that describes the number of common structural fragments in a molecule. • N 1 Number of fragments in molecule 1 • N 2 Number of fragments in molecule 2 • Nc Number of fragments common to both molecules 1 and 2

Example 1 naphthalene 2 clozapine 3 piperidine 1. 000 0. 088 0. 000 0. 088 1. 000 0. 132 0. 000 0. 132 1. 000 Matrix of Tanimoto coefficients for the compounds naphthalene clozapine and piperidine

Lead Hopping • Ideas of molecular similarity are interesting to drug designers, because it provides a way to find molecules that are functionally similar to an active compound but might avoid other problems such as low activity, poor ADME properties or competitive patents. ($$$$) • The idea of generating new biologically active molecules that contain an active pharmacophore but are built on a different molecular scaffold is known as lead hopping. This can be thought of jumping from one region of chemical space to another.

The statins • Perhaps the best known example of lead hopping is the ‘statins’. This class of drugs inhibit cholesterol biosynthesis by blocking the enzyme HMG-Co. A reductase which performs a key step in the cholesterol biosynthetic pathway

Statins Bound conformations of (left) atorvastatin and mevastatin and (right) atorvastatin and cerivastatin. Note the close overlap of the portion of the drug that mimics the glutaryl portion of HMG-Co. A. Note also the overlap of the isopropyl and fluorophenyl groups in fluvastatin and cerivastatin

Shape-based lead hopping • In another example, Rush et al. (J. Med. Chem. 2005, 48, 1489 -1495) describe using a shape based similarity method to identify a new class of antibacterial compounds. • In this case, the target is the Zip. A-Fts. Z protein-protein interaction which has an essential role in the formation of the septal ring - a membrane-associated organelle that drives constriction and formation of new cell walls during cell division.

Shape-based lead hopping • Wyeth identified a lead compound by HTS screening and crystallised it in the target protein

Shape-based lead hopping Bound structure of the lead compound in Zip. A

Shape-based lead hopping • The researchers then used a shape-based comparison (Rapid Overlay of Chemical Structures or ROCS) to find related compounds that were available for testing. • The comparison between the search structures and the query molecules used a Tanimoto comparison.

ROCS

ROCS similarity of 0. 9 (i. e. mostly shape similarity) does not correspond to closely to similarity calculated using a Tanimoto coefficient (about 0. 35) in this case.

ROCS • A number of active compounds were identified. They report the crystal structure of one of the compounds (ROCS_PART_18) bound to the target protein

Conclusion • This lecture has provided some background to chemical databases, similarity methods and some of the processes involved in identifying new lead compounds for drug discovery. • Areas for revision: – representations of molecules and data (eg File formats, SMILES) – molecular databases and the importance of being able to search molecular structures – molecular fingerprints – prediction of molecular properties, – lead-hopping