KNOWLEDGEBASED METHOD FOR DETERMINING THE MEANING OF AMBIGUOUS
KNOWLEDGE-BASED METHOD FOR DETERMINING THE MEANING OF AMBIGUOUS BIOMEDICAL TERMS USING INFORMATION CONTENT MEASURES OF SIMILARITY Bridget Mc. Innes Ted Pedersen 1 Ying Liu Genevieve B. Melton Serguei Pakhomov
OBJECTIVE OF THIS WORK Develop and evaluate a method than can disambiguate terms in biomedical text by exploiting similarity information extrapolated from the Unified Medical Language System Evaluate the efficacy of Information Content-based similarity measures over path-based similarity measures for Word Sense Disambiguation, WSD 2
WORD SENSE DISAMBIGUATION Word sense disambiguation is the task of determining the appropriate sense of a term given context in which it is used. TERM: tolerance Drug Tolerance Immune Tolerance 3
WORD SENSE DISAMBIGUATION Word sense disambiguation is the task of determining the appropriate sense of a term given context in which it is used. Busprione attenuates tolerance to morphine in mice with skin cancer Drug Tolerance Immune Tolerance 4
SENSE INVENTORY: UNIFIED MEDICAL LANGUAGE SYSTEM Unified Medical Language Sources (UMLS) Semantic Network Metathesaurus ~1. 7 million biomedical and clinical concepts; integrated semiautomatically CUIs (Concept Unique Identifiers), linked: Hierarchical: PAR/CHD and RB/RN Non-hierarchical: SIB, RO Sources viewed together or independently Medical Subject Heading (MSH) SPECIALIST Lexicon Biomedical and clinical terms, including variants 5
WORD SENSE DISAMBIGUATION Busprione attenuates tolerance to morphine in mice with skin cancer Drug Tolerance: C 0013220 Immune Tolerance: C 0020963 Concept Unique Identifiers: CUIs 6
SENSERELATE ALGORITHM Each possible sense of a target word is assigned a score [sum similarity between it and its surrounding terms] Assign target word the sense with highest score Proposed by Patwardhan and Pedersen 2003 using Word. Net UMLS: : Sense. Relate is a modification of this algorithm using information from the UMLS NEXT UP: an example 7
SENSERELATE EXAMPLE Busprione attenuates tolerance to morphine in mice with skin cancer 8
SENSERELATE EXAMPLE Busprione attenuates tolerance to morphine in mice with skin cancer Drug Tolerance: C 0013220 Immune Tolerance: C 0020963 9
SENSERELATE EXAMPLE Busprione attenuates tolerance to morphine in mice with skin cancer Drug Tolerance: C 0013220 Busprione: C 0006462 Morphine: C 0026549 Immune Tolerance: C 0020963 Mice: C 0026809 Skin cancer: C 0007114 10
SENSERELATE EXAMPLE Busprione attenuates tolerance to morphine in mice with skin cancer Drug Tolerance: C 0013220 0. 09 Busprione: C 0006462 Immune Tolerance: C 0020963 0. 11 0. 16 Morphine: C 0026549 Mice: C 0026809 Skin cancer: C 0007114 11
SENSERELATE EXAMPLE Busprione attenuates tolerance to morphine in mice with skin cancer Drug Tolerance Score = 0. 09 + 0. 16 + 0. 11 = 0. 45 Drug Tolerance: C 0013220 0. 09 Busprione: C 0006462 Immune Tolerance: C 0020963 0. 11 0. 16 Morphine: C 0026549 Mice: C 0026809 Skin cancer: C 0007114 12
SENSERELATE EXAMPLE Busprione attenuates tolerance to morphine in mice with skin cancer Drug Tolerance Score = 0. 09 + 0. 16 + 0. 11 = 0. 45 Drug Tolerance: C 0013220 0. 09 Busprione: C 0006462 Immune Tolerance: C 0020963 0. 11 0. 16 Morphine: C 0026549 0. 09 0. 05 Mice: C 0026809 0. 04 Skin cancer: C 0007114 13
SENSERELATE EXAMPLE Busprione attenuates tolerance to morphine in mice with skin cancer Drug Tolerance Score = 0. 09 + 0. 16 + 0. 11 = 0. 45 Immune Tolerance Score = 0. 09 + 0. 05 = 0. 27 Drug Tolerance: C 0013220 0. 09 Busprione: C 0006462 Immune Tolerance: C 0020963 0. 11 0. 16 Morphine: C 0026549 0. 09 0. 05 Mice: C 0026809 0. 04 Skin cancer: C 0007114 14
SENSERELATE EXAMPLE Busprione attenuates tolerance to morphine in mice with skin cancer Drug Tolerance Score = 0. 09 + 0. 16 + 0. 11 = 0. 45 Immune Tolerance Score = 0. 09 + 0. 05 = 0. 27 Drug Tolerance: C 0013220 0. 09 Busprione: C 0006462 Immune Tolerance: C 0020963 0. 11 0. 16 Morphine: C 0026549 0. 09 0. 05 Mice: C 0026809 0. 04 Skin cancer: C 0007114 15
SENSE RELATE ASSUMPTION An ambiguous word is often used in the sense that is most similar to the sense of the terms that surround it 16
SENSERELATE COMPONENTS Identifying the concepts of surrounding terms Calculating semantic similarity 17
IDENTIFYING THE CONCEPTS OF THE SURROUNDING TERMS Use the SPECIALIST LEXICON to identify the terms and map the terms doing a string match to the MRCONSO table in the UMLS 18
IDENTIFYING THE CONCEPTS OF THE SURROUNDING TERMS Use the SPECIALIST LEXICON to identify the terms and map the terms doing a string match to the MRCONSO table in the UMLS Busprione attenuates tolerance to morphine in mice with skin cancer 19
IDENTIFYING THE CONCEPTS OF THE SURROUNDING TERMS Use the SPECIALIST LEXICON to identify the terms and map the terms doing a string match to the MRCONSO table in the UMLS Busprione attenuates tolerance to morphine in mice with skin cancer SPECIALIST LEXICON. . . skin cancer skin grafting skin disease. . . 20
IDENTIFYING THE CONCEPTS OF THE SURROUNDING TERMS Use the SPECIALIST LEXICON to identify the terms and map the terms doing a string match to the MRCONSO table in the UMLS Busprione attenuates tolerance to morphine in mice with skin cancer MRCONSO skin cancer skin grafting skin disease . . . C 0007114 C 0037297 C 0037274 SPECIALIST LEXICON. . . skin cancer skin grafting skin disease. . . 21
SEMANTIC SIMILARITY MEASURES Path-based measures Path Wu and Palmer Leacock and Chodorow Ngyuen and Al-Mubaid Information content (IC)-based measures Resnik Lin Jiang and Conrath 22
PATH-BASED SIMILARITY MEASURES Use only the path information obtained from a taxonomy 23
PATH-BASED SIMILARITY MEASURES Use only the path information obtained from a taxonomy Path measure sim(c 1, c 2) = 1 / minpath(c 2, c 2) where minpath is the shortest path between the two concepts 24
PATH-BASED SIMILARITY MEASURES Use only the path information obtained from a taxonomy Path measure sim(c 1, c 2) = 1/minpath(c 2, c 2) where minpath is the shortest path between the two concepts Wu and Palmer, 1994 sim(c 1, c 2) = (2*depth(LCS(c 2, c 2))) / (depth(c 1)+depth(c 2)) where LCS is the least common subsumer of the two concepts 25
PATH-BASED SIMILARITY MEASURES Use only the path information obtained from a taxonomy Path measure sim(c 1, c 2) = 1/ minpath(c 2, c 2) Wu and Palmer, 1994 sim(c 1, c 2) = (2*depth(LCS(c 2, c 2))) / (depth(c 1)+depth(c 2)) where minpath is the shortest path between the two concepts where LCS is the least common subsumer of the two concepts Leacock and Chodorow, 1998 sim(c 1, c 2) = -log( minpath(c 1, c 2) / (2 D) ) where D is the total depth of the taxonomy 26
PATH-BASED SIMILARITY MEASURES Use only the path information obtained from a taxonomy Path measure sim(c 1, c 2) = 1/ minpath(c 2, c 2) Leacock and Chodorow, 1998 sim(c 1, c 2) = -log( minpath(c 1, c 2) / (2 D) ) where D is the total depth of the taxonomy Wu and Palmer, 1994 sim(c 1, c 2) = (2*depth(LCS(c 2, c 2))) / (depth(c 1)+depth(c 2)) where minpath is the shortest path between the two concepts where LCS is the least common subsumer of the two concepts Nyguen and Al-Mubaid, 2006 sim(c 1, c 2) = log ( (2 + minpath(c 1, c 2) - 1) * (D - depth(LCS(c 1, c 2))) ) 27
PATH-BASED SIMILARITY MEASURES Disease: C 0012634 Drug Related Disorder: C 0277579 Neoplasm: C 1302761 Drug Tolerance: C 0013220 USE ONLY THE PATH INFORMATION OBTAINED FROM A TAXONOMY Neoplastic Disease: C 1882062 Malignant Neoplasm: C 0006826 Skin cancer: C 0007114 28
INFORMATION CONTENT-BASED MEASURES Incorporate the probability of the concepts IC = -log(P(concept)) 29
INFORMATION Incorporate the probability of the concepts CONTENT-BASED MEASURES IC = -log(P(concept)) P(concept) Calculated by summing the probability of the concept and the probability of its descendants Probabilities are obtained from an external corpus 30
INFORMATION Incorporate the probability of the concepts CONTENT-BASED MEASURES IC = -log(P(concept) Resnik, 1995 sim(c 1, c 2) = IC(LCS(c 1, c 2)) 31
INFORMATION Incorporate the probability of the concepts CONTENT-BASED MEASURES IC = -log(P(concept) Resnik, 1995 sim(c 1, c 2) = IC(LCS(c 2, c 2)) Jiang and Conrath, 1997 sim(c 1, c 2) = 1 / (IC(c 1)+IC(c 2) – 2* IC(LCS(c 1, c 2)) 32
INFORMATION Incorporate the probability of the concepts sim(c 1, c 2) = IC(LCS(c 2, c 2)) Jiang and Conrath, 1997 IC = -log(P(concept) Resnik, 1995 CONTENT-BASED MEASURES sim(c 1, c 2) = 1 ÷ (IC(c 1)+IC(c 2) – 2* IC(LCS(c 1, c 2)) Lin, 1998 sim(c 1, c 2) = (2*IC(LCS(c 2, c 2))) / (IC(c 1)+IC(c 2)) 33
IC-BASED SIMILARITY MEASURES PATH INFORMATION PROBABILITY OF CONCEPTS Disease: C 0012634 Drug Related Disorder: C 0277579 Drug Tolerance : C 0013220 Neoplas m: C 1302761 Neoplasti c Disease: C 1882062 Malignan t Neoplas m: C 0006826 Skin cancer: C 0007114 + EXTERNAL CORPUS 34
EXPERIMENTAL FRAMEWORK Use open-source UMLS: : Similarity package to obtain the similarity between the terms and possible senses in the Sense. Relate algorithm Path information: parent/child relations in MSH source Information content: calculated using the UMLSon. Medline dataset created by NLM Consists of concepts from 2009 AB UMLS and the frequency they occurred in Medline using the Essie Search Engine (Ide et al 2007) Medline: database of citations of biomedical/clinical 35
EVALUATION DATA: MSH WSD MSH-WSD dataset (Jimeno-Yepes, et al 2011) 203 target words (ambiguous word) from Medline terms acronyms mixtures e. g. tolerance e. g. CA (calcium, california) e. g. bat (brown adipose tissue) Each target word contains ~187 instances (Medline abstracts) 106 88 9 abstract = ~ 500 words Each target word in the instances assigned a concept from MSH by exploiting the manually assigned MSH concepts assigned to the abstract Average of 2. 08 possible senses per target word Majority sense over all the target words is 54. 5% 36
RESULTS 0. 8 0. 7 a c c u r a c y 0. 6 0. 730000000 0. 740000000 0. 720000000 0. 70000 0. 720000000 0. 6900000001000001000001 0. 55 0. 4 0. 3 0. 2 0. 1 0 baseline pat h lch wu p Path-based na m res jcn IC-based lin 37
COMPARISON ACROSS SUBSETS OF MSH-WSD 1 a c c u r a c y 0. 8700000 0. 93 0. 8500000 0002 0. 88 0. 9 0001 0. 8 0. 78 0. 7300000 0. 7400000 0. 7100000 0. 8 0002 0001 0. 67000000000 0001 0002 0. 7 0. 6 0. 55 0. 54 0. 53 0. 55 Baseline 0. 5 0. 4 Sense. Relat e 0. 3 0. 2 0. 1 38 0 Terms Acronyms Mixture Overall
COMPARISON ACROSS SUBSETS OF MSH-WSD 1 a c c u r a c y 0. 8700000 0. 93 0. 8500000 0002 0. 88 0. 9 0001 0. 8 0. 78 0. 7300000 0. 7400000 0. 7100000 0. 8 0002 0001 0. 67000000000 0001 0002 0. 7 0. 6 0. 55 0. 54 0. 53 0. 55 Baseline 0. 5 0. 4 Sense. Relat e 0. 3 0. 2 0. 1 39 0 Terms Acronyms Mixture Overall
COMPARISON ACROSS SUBSETS OF MSH-WSD 1 a c c u r a c y 0. 8700000 0. 93 0. 8500000 0002 0. 88 0. 9 0001 0. 8 0. 78 0. 7300000 0. 7400000 0. 7100000 0. 8 0002 0001 0. 67000000000 0001 0002 0. 7 0. 6 0. 55 0. 54 0. 53 0. 55 Baseline 0. 5 0. 4 Sense. Relat e 0. 3 0. 2 0. 1 40 0 Terms Acronyms Mixture Overall
COMPARISON ACROSS SUBSETS OF MSH-WSD 1 a c c u r a c y 0. 8700000 0. 93 0. 8500000 0002 0. 88 0. 9 0001 0. 8 0. 78 0. 7300000 0. 7400000 0. 7100000 0. 8 0002 0001 0. 67000000000 0001 0002 0. 7 0. 6 0. 55 0. 54 0. 53 0. 55 Baseline 0. 5 0. 4 Sense. Relat e 0. 3 0. 2 0. 1 41 0 Terms Acronyms Mixture Overall
COMPARISON ACROSS SUBSETS OF MSH-WSD 1 a c c u r a c y 0. 8700000 0. 93 0. 8500000 0002 0. 88 0. 9 0001 0. 8 0. 78 0. 7300000 0. 7400000 0. 7100000 0. 8 0002 0001 0. 67000000000 0001 0002 0. 7 0. 6 0. 55 0. 54 0. 53 0. 55 Baseline 0. 5 0. 4 Sense. Relat e 0. 3 0. 2 0. 1 42 0 Terms Acronyms Mixture Overall
WINDOW SIZES Use the terms surrounding the target word within a specified window: 1, 2, 5, 10, 25, 50, 60, 70 WINDOW SIZE = 2 usprione attenuates tolerance to morphine in mice with skin_cancer 43
COMPARISON 0. 7400000000 0. 710000 0. 6900002 00002 0. 6500001 00002 0. 8 0. 7 a c c u r a c y 0. 6 0. 5 OF WINDOW SIZES FOR LIN 0. 53 0. 4 lin 0. 3 0. 2 0. 1 0 44 0 1 2 5 10 25 window size 50 60 70
SURROUNDING TERMS Not all terms have a concept in the UMLS therefore Not all surrounding terms in the window mapped to CUIs 45
WINDOW SIZES VERSUS MAPPED TERMS 18 n u m b e r o f m a p p i n g s 15. 64 16 14. 28 14 12. 96 12 10 lin 7. 6 8 6 3. 49 4 1. 85 2 0 0 0. 27 0 1 0. 79 46 2 5 10 25 window size 50 60 70
FUTURE WORK: MAPPING TERMS Currently looking at mapping the terms to CUIs using information from the concept mapping system Meta. Map � Obtain the terms from Meta. Map and do a dictionary look up in MRCONSO Hypothesis – the terms obtained by Meta. Map are more accurate than using the SPECIALIST Lexicon � Obtain the CUIs from Meta. Map Hypothesis – the CUIs obtained by Meta. Map will be more accurate than the dictionary look-up 47
OBJECTIVE #1 Develop and evaluate a method than can disambiguate terms in biomedical text by exploiting similarity information extrapolated from the UMLS: : Sense. Relate statistically significantly higher disambiguation accuracy than the baseline On par with previous unsupervised methods for terms 48
OBJECTIVE #2 Evaluate the efficacy of IC-based similarity measures over path-based measures on a secondary task There is no statistically significant difference between the accuracies obtained by the IC-based measures There is a statistically significant difference between the IC-based measures and the path-based measures 49
TAKE HOME MESSAGE: An ambiguous word is often used in the sense that is most similar to the sense of the concepts of the terms that surround it 50
RESOURCES Software: UMLS: : Sense. Relate UMLS: : Similarity http: //search. cpan. org/dist/UMLS-Sense. Relate/ http: //search. cpan. org/dist/UMLS-Similarity/ Data MSH-WSD http: //wsd. nlm. nih. gov/collaboration. shtml 51
RESOURCES Software: UMLS: : Sense. Relate UMLS: : Similarity http: //search. cpan. org/dist/UMLS-Sense. Relate/ http: //search. cpan. org/dist/UMLS-Similarity/ Data MSH-WSD http: //wsd. nlm. nih. gov/collaboration. shtml THANK YOU 52
RESOURCES Software: UMLS: : Sense. Relate UMLS: : Similarity http: //search. cpan. org/dist/UMLS-Sense. Relate/ http: //search. cpan. org/dist/UMLS-Similarity/ Data MSH-WSD http: //wsd. nlm. nih. gov/collaboration. shtml QUESTIONS ? 53
- Slides: 53