Finding Highfrequent Synonyms of a Domainspecific Verb in
Finding High-frequent Synonyms of a Domainspecific Verb in English Sub-language of MEDLINE Abstracts Using Word. Net Chun Xiao and Dietmar Rösner Institut für Wissensund Sprachverarbeitung (IWS), Faculty of Computer Science, University of Magdeburg, 39016 Magdeburg, Germany 1
Introduction — MEDLINE Abstract • MEDLINE®: – Domain: clinical medicine, biological and physical sciences; – Source: articles from over 4, 600 journals published throughout the world; – Coverage: abstracts are included for about 52% of the articles. • Pub. Med®, an application of UMLS (unified medical language system), provides links within MEDLINE® to the full text of 15 clinical medical journals. – Available at: http: //www. ncbi. nlm. nih. gov/Pub. Med/ 2
Available Resources in the Experiment • The test corpus consists of 800 MEDLINE abstracts extracted from the GENIA Corpus V 3. 0 p and V 3. 01. - Available at: http: //www-tsujii. is. s. u-tokyo. ac. jp/GENIA/ • Word. Net 1. 7. 1 3
Extraction of a Specific Relation • Inhibitory relation – • Example: Secreted from activated T cells and macrophages, bone marrow-derived MIP-1 alpha/GOS 19 inhibits primitive hematopoietic stem cells and appears to be involved in the homeostatic control of stem cell proliferation. Semantic annotations in the GENIA corpus: v protein_molecule u cell_type 4
High-frequent Verbs in the Test Corpus 5
Synonym Sets (Synsets) of Verb inhibit • Synset in Word. Net Sense 1 suppress, stamp down, inhibit, subdue, conquer, curb => control, hold in, hold, contain, check, curb, moderate Sense 2 inhibit => restrict, restrain, trammel, limit, bound, confine, throttle • Synset in test corpus of MEDLINE abstracts Inhibit, block, prevent, etc. 6
Problem • Occurrences of verbs in the two synsets in the test corpus of MEDLINE abstracts – WN-synonyms: suppress (69), limit (16), restrict (5) – non WN-synonyms: block (124), reduce (119), prevent(53) • How can Word. Net synsets and information from the corpus be combined to create domain-specific verb synsets? 7
Three Definitions • Language unit — a text segment (a sentence, several sentences, or a paragraph, etc. ) that expresses one semantic topic. • Core word — the verb, whose synset in the test corpus is to be found out. E. g. , in this test inhibit is the core word. • Keyword — the word, whose corresponding verb base form is the core word. E. g. , in this test inhibitor, inhibiting, and so on are keywords. 8
Example We performed an analysis of the mechanisms by which two PKC inhibitors, Calphostin C and Staurosporine, prevent the FN-induced IL-1 beta response. Both inhibitors blocked the secretion of IL-1 beta protein into the media of peripheral blood mononuclear cells exposed to FN. • • • v Language unit: two sentences Core word: inhibit Keyword: inhibitor (2 times) Local context: searching window size >=3 Verbs around the first keyword: perform, prevent, block, expose Verbs around the second keyword: prevent, perform, block, expose In the following test, the language unit is selected to be the whole abstract. 9
Idea Description • Assumption: The synonyms of a verb co-occur much more frequently together with the keywords of the verb than together with other words in the language unit. • Method: Thus the verb chunks around the keywords are collected, from which the synonyms of the core word will be selected and filtered, using Word. Net synset information. - One resource: Word. Net synset information - The other resource: Local context information in the test corpus 10
Distribution of Keywords of inhibit in the Test Corpus 11
Verbs around the Keywords in the Test Corpus 12
Method Description I • Expansion of Word. Net Synsets (Si) – S 1 : the verb collection of synonyms of all synonyms of the core word; – S 2 : the verb collection of synonyms of all verbs in S 1; – … • Expansion of Stoplist (STOPk) – STOP 0: manually select 15 stop-verbs from the highfrequent verbs in the test corpus (e. g. , suggest, indicate, including the high-frequent antonyms of the core word); – STOP 1: the verb collection of synonyms of all verbs in STOP 0; – … 13
Method Description II • Verb list from the corpus (Vj) Verbs around the keywords in a local context of searching window size of j are collected. • Synonym candidate list (Sg) If a verb is in Vj and also in Si, but not in STOPk, then add it to Sg. 14
Evaluation • Golden standard list (SG) – A manually created synonym list, which is extracted from the test corpus. – Consist of 10 verbs with the most frequent occurrences, in which 3 verbs come directly from the Word. Net synset of “inhibit”, the rest 7 verbs come from its hypernym set or the expanded list of its synonyms. • Recall & Precision 15
Result v 60% recall of SG <=> 93. 05% occurrences in the test corpus 16
Conclusions and Future Work • Conclusions – – English sublanguage of MEDLINE abstract; The core word and its keywords were high-frequent; Multiword verb structures were not considered yet; Balance between recall and precision: expansion of Si and STOPk should be limited. • Future works – – Consideration of other Word. Net information besides synsets; Automatic creation of stoplists; Extraction of multiword verb structures; Utilization of syntactic information. 17
Thanks! 18
Looking forward to your questions! 19
20
Possible Errors • Errors of POS tags between Adjectives <=> Past participles • Errors of manual works when selecting stop-verbs 21
Question or Hope Can Word. Net provide the possibility for accessing multiword expressions? 22
- Slides: 22