Chem Spider as a Platform for Crowd Participation

  • Slides: 63
Download presentation
Chem. Spider as a Platform for Crowd Participation in Curating Chemistry Antony Williams IDCC,

Chem. Spider as a Platform for Crowd Participation in Curating Chemistry Antony Williams IDCC, Chicago, December 2010

WARNING: Chemistry is Dangerous

WARNING: Chemistry is Dangerous

Di-Hydrogen Monoxide

Di-Hydrogen Monoxide

Di-Hydrogen Monoxide 2 H

Di-Hydrogen Monoxide 2 H

Di-Hydrogen Monoxide 2 H + 1 O

Di-Hydrogen Monoxide 2 H + 1 O

Di-Hydrogen Monoxide H 2 O

Di-Hydrogen Monoxide H 2 O

Di-Hydrogen Monoxide H 2 O Water

Di-Hydrogen Monoxide H 2 O Water

It’s all on Wikipedia…

It’s all on Wikipedia…

Chemistry on the Internet – Not All Bad § 100 s of websites hosting

Chemistry on the Internet – Not All Bad § 100 s of websites hosting chemistry-related data § Chemistry information is generally “compound-based” § Chemical “structures” § Identifiers, names and synonyms § Properties § Analytical data § How to synthesize § Articles, patents, safety information § Chemistry “language and dialects”

Dialects describing chemicals

Dialects describing chemicals

A Pragmatic Vision “Build a Structure Centric Community” § Integrate chemistry across the internet

A Pragmatic Vision “Build a Structure Centric Community” § Integrate chemistry across the internet based on “chemical structure” § A “structure-based hub” to information and data § Let chemists contribute their own data § Allow the community to curate & annotate data

www. chemspider. com

www. chemspider. com

Answering Questions for Chemists § Questions a chemist might ask… § What is the

Answering Questions for Chemists § Questions a chemist might ask… § What is the melting point of n-heptanol? § What is the chemical structure of Xanax? § Chemically, what is phenolphthalein? § What are the stereocenters of cholesterol? § Where can I find publications about xylene? § What are the different trade names for Aspirin? § What is the NMR spectrum of Benzoic Acid? § What are the safety handling issues for toluene?

Search for a Chemical…by name

Search for a Chemical…by name

Available Information… § Linked to chemical vendors, safety data, toxicity, metabolism…

Available Information… § Linked to chemical vendors, safety data, toxicity, metabolism…

Available Information….

Available Information….

Chem. Spider Today § § Almost 25 million unique chemicals Over 400 data sources

Chem. Spider Today § § Almost 25 million unique chemicals Over 400 data sources Grows daily – community and RSC depositions Community annotation and curation § We curate, edit, change, enhance data daily

Three Years of Experience § Internet-based chemistry is a mess! § Public compound databases

Three Years of Experience § Internet-based chemistry is a mess! § Public compound databases are contaminated § The annotation/curation of data online is difficult § Most database hosts are non-responsive to feedback – “We are a host/repository of data” § Who cares?

Linked Data on the Web

Linked Data on the Web

Where is chemistry online? § § § § § Encyclopedic articles (Wikipedia) Chemical vendor

Where is chemistry online? § § § § § Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Property databases Patents with chemical structures Drug Discovery data Scientific publications Compound aggregators Blogs/Wikis and Open Notebook Science

What is the Structure of Vitamin K?

What is the Structure of Vitamin K?

Me. SH – Medical Subject Headings § Several forms of vitamin K have been

Me. SH – Medical Subject Headings § Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants, VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione).

What is the Structure of Vitamin K 1?

What is the Structure of Vitamin K 1?

What is the Structure of Vitamin K 1?

What is the Structure of Vitamin K 1?

Chemical Abstracts “Common Chemistry” Database

Chemical Abstracts “Common Chemistry” Database

Wikipedia WRONG

Wikipedia WRONG

WRONG

WRONG

Incorrect Structures WRONG

Incorrect Structures WRONG

Lack of Stereochemistry WRONG

Lack of Stereochemistry WRONG

Does stereochemistry matter? § Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, Softenon, Thalidomide

Does stereochemistry matter? § Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, Softenon, Thalidomide

WRONG

WRONG

Pub. Chem

Pub. Chem

WRONG

WRONG

WRONG

WRONG

What’s Methane?

What’s Methane?

What’s Methane?

What’s Methane?

What ELSE is Methane? ? ?

What ELSE is Methane? ? ?

Internet-Based Chemistry is a Mess § Algorithms can get you so far § Human

Internet-Based Chemistry is a Mess § Algorithms can get you so far § Human curation is necessary § Only the crowds can help with big data… Chem. Spider is approaching 25 million compounds

Search “Vitamin H”

Search “Vitamin H”

Search “Vitamin H”

Search “Vitamin H”

“Curate” Identifiers

“Curate” Identifiers

“Curate” Identifiers

“Curate” Identifiers

“Curate” Identifiers

“Curate” Identifiers

Crowd-sourcing Chemistry Curation § Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to

Crowd-sourcing Chemistry Curation § Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate

“Curate” Identifiers § General curation activities § Remove incorrect names § Correct spellings §

“Curate” Identifiers § General curation activities § Remove incorrect names § Correct spellings § Add multilingual names § Add alternative names § In 3 years over 1 million structure-identifier relationships have been validated – robotically and manually § 130 people have participated in validation or annotation. “Crowds” can be quite small!

Crowdsourcing Works § The “crowd” has deposited data (structures, spectra, etc) and participated in

Crowdsourcing Works § The “crowd” has deposited data (structures, spectra, etc) and participated in data curation § Different level curators check each others work § Wikipedia is the modern primary example § Some curators are “madmen”…

Crowdsourcing Works § The “crowd” has deposited data (structures, spectra, etc) and participated in

Crowdsourcing Works § The “crowd” has deposited data (structures, spectra, etc) and participated in data curation § Different level curators check each others work § Wikipedia is the modern primary example § Some curators are “madmen”… § The Oxford English Dictionary

Vancomycin – Curate This!!!

Vancomycin – Curate This!!!

Vancomycin on Chem. Spider 1 compound – 3 days

Vancomycin on Chem. Spider 1 compound – 3 days

Crowdsourced “Annotations” § Users can add § Descriptions/Syntheses/Commentaries § Links to articles § Spectral

Crowdsourced “Annotations” § Users can add § Descriptions/Syntheses/Commentaries § Links to articles § Spectral data § Photos § MP 3 files § Videos

Multimedia Content Holder

Multimedia Content Holder

Gaming for Curation of Spectra

Gaming for Curation of Spectra

Chem. Spider Everywhere Crowdsourced Curation of Spectra

Chem. Spider Everywhere Crowdsourced Curation of Spectra

Data Curation

Data Curation

True Curation of Data

True Curation of Data

Chem. Spider Synthetic. Pages

Chem. Spider Synthetic. Pages

CAS Com. Ch. EBI Chem. Spider Chem. IDPlus Daily. Med Drug. Bank Pub. Chem

CAS Com. Ch. EBI Chem. Spider Chem. IDPlus Daily. Med Drug. Bank Pub. Chem Wikipedia Drug Name Generic Name Tiotropium No Hits Spiriva Bromide Depakote Valproate semisodium Basen Voglibose Symbicort 1) Budesonide ü ü ü Symbicort 2) Formoterol WRONG Vytorin 1) Ezetimibe Vytorin 2) Simvastatin Taxol Paclitaxel Thalidomide Zocor Simvastatin Crestor Rosuvastatin ü ü ü No Hits ü ü û û ü ü ü No Hits ü ü No Hits û ü û ü ü No Hits ü ü ü ü û ü ü û û 4/0 ü ü No Structure 2/1 ü ü û ü ü 8/1 6/1 ü 2/1 44/1 ü 2/1

Sharing Our Activities § Presently defining approaches with other public compound databases to share

Sharing Our Activities § Presently defining approaches with other public compound databases to share results of curation activities § Member of large European project to link data from the Life Sciences. Sharing results of curation is essential § Making curation and contribution interfaces Mobile

Mobile Chem. Spider

Mobile Chem. Spider

First request to Database Hosts! § Every public compound database host should add ONE

First request to Database Hosts! § Every public compound database host should add ONE feature – “Leave Comments”

Second request to Database Hosts! Show Comments

Second request to Database Hosts! Show Comments

Question Quality

Question Quality

Thank you Email: williamsa@rsc. org Twitter: Chem. Connector Blog: www. chemspider. com/blog Personal Blog:

Thank you Email: williamsa@rsc. org Twitter: Chem. Connector Blog: www. chemspider. com/blog Personal Blog: www. chemconnector. com SLIDES: www. slideshare. net/Antony. Williams