Empirical Validation of the Effectiveness of Chemical Descriptors

  • Slides: 21
Download presentation
Empirical Validation of the Effectiveness of Chemical Descriptors in Data Mining Kirk Simmons Du.

Empirical Validation of the Effectiveness of Chemical Descriptors in Data Mining Kirk Simmons Du. Pont Crop Protection Stine-Haskell Research Center 1090 Elkton Road Newark, DE 19711 kirk. a. simmons@usa. dupont. com

The Study • Purpose • Strategy – Methods – Metrics • Results • Practical

The Study • Purpose • Strategy – Methods – Metrics • Results • Practical Application • Conclusions

Purpose • Chemical Structure Conference (1996) – Holland – – Data mining/similarity methodologies reported

Purpose • Chemical Structure Conference (1996) – Holland – – Data mining/similarity methodologies reported Used numerous descriptor sets No standard datasets Comparisons difficult • Comparative study of chemical descriptors across varied biology

Strategy • Systematically evaluate descriptors within a compound dataset across multiple biological endpoints •

Strategy • Systematically evaluate descriptors within a compound dataset across multiple biological endpoints • All compounds have experimentally measured endpoints • Diversity of biological endpoints – In-Vitro (receptor affinity, enzyme inhibition) – In-Vivo (insect mortality) • Explored nine common descriptor sets • Train and then use model to forecast a validation set

Methods • Four In-Vitro assays – 48 K compound dataset for training – Corporate

Methods • Four In-Vitro assays – 48 K compound dataset for training – Corporate database for validation • Two In-Vivo assays – 75 -100 K compound datasets – Randomly divided into training and validation subsets • Recursive Partitioning - analytic method – Appropriate method for HTS data – Selected statistically conservative inputs (p-tail < 0. 01)

Metrics • 4 -way Interaction – Analytic Method, Compound Set, Biology, and Descriptors •

Metrics • 4 -way Interaction – Analytic Method, Compound Set, Biology, and Descriptors • Efficiency of analysis (Lift Chart) – Fraction of Actives found/Fraction of Dataset tested – Rewards efficiency only • Effectiveness of analysis (Composite Score) – Fraction of Actives found x Efficiency – Rewards efficiency as well as completeness

Results - Training

Results - Training

Results - Forecasting

Results - Forecasting

Averaged Results - Training

Averaged Results - Training

Averaged Results - Forecasting

Averaged Results - Forecasting

Practical Application • RP-based models using screening data on 3 targets – Activity treated

Practical Application • RP-based models using screening data on 3 targets – Activity treated as active/inactive – Diverse. Solutions. R BCUT descriptors • RP-models used to forecast vendor compounds (1 M) • Selected compounds purchased/screened – Hit-rates improved 530% over training sets – New structures and improved activity

Historical Screening Results

Historical Screening Results

RP-based Screening Results

RP-based Screening Results

Results Comparison

Results Comparison

Conclusions • Not all chemical descriptors equally effective – Whole molecule property-based less effective

Conclusions • Not all chemical descriptors equally effective – Whole molecule property-based less effective – Chemical feature-based appear more effective • Training models effectiveness – Averaged 28% of theory – Room for 4 -fold improvement • Validation models effectiveness – Averaged 16% of theory – Room for 6 -fold improvement

Acknowledgements • Dr. Linrong Yang, FMC Corporation – Completed the work • FMC Corporation

Acknowledgements • Dr. Linrong Yang, FMC Corporation – Completed the work • FMC Corporation – Release of the results • • Prof. Peter Willett, University of Sheffield Prof. Alex Tropsha, University of North Carolina Prof. Doug Hawkins, University Minnesota Du. Pont Corporation