Word Net WSD Word Net What is Word

  • Slides: 17
Download presentation
Word. Net, WSD

Word. Net, WSD

Word. Net • What is Word. Net? • Miller 95: “Word. Net is an

Word. Net • What is Word. Net? • Miller 95: “Word. Net is an online lexical database designed for use under program control. English nouns, verbs, adjectives, and adverbs are organized into sets of synonyms, each representing a lexicalized concept. Semantic relations link the synonym sets. ”

Word. Net • Go to the main Word. Net site: http: //wordnet. princeton. edu/

Word. Net • Go to the main Word. Net site: http: //wordnet. princeton. edu/ • Open the wordnet folder on pongo: ~/dropbox/570/wordnet/dict

Word. Net Vocabulary • See glossary at: http: //wordnet. princeton. edu/gloss • synset: A

Word. Net Vocabulary • See glossary at: http: //wordnet. princeton. edu/gloss • synset: A synonym set; a set of words that are interchangeable in some context • lemma: lower case ASCII text of word as found in the Word. Net database index files • lexical pointer: A lexical pointer indicates a relation between words in synsets

Navigating Word. Net files • data. * files – the actual network files (synsets)

Navigating Word. Net files • data. * files – the actual network files (synsets) • index. * files – contains lower case instances of all words in Word. Net, with pointers to the synset entries in the network

Word. Net data file Synset file offset Synset type File number # words in

Word. Net data file Synset file offset Synset type File number # words in synset word 00045430 04 n 01 performance 3 003 @ 00033580 n 0000 ~ 00045680 n 0000 ~ 00045874 n 0000 | any recognized accomplishment; "they admired his performance under stress“ 00045680 04 n 01 overachievement 0 003 @ 00045430 n 0000 + 02537922 v 0101 ! 00045874 n 0101 | better than expected performance (better than might have been predicted from intelligence tests) # pointers to other synsets See: wndb Type of pointer POS

Pointer symbols • For nouns: ! Antonym @ Hypernym ~ Hyponym #m Member holonym

Pointer symbols • For nouns: ! Antonym @ Hypernym ~ Hyponym #m Member holonym #s Substance holonym #p Part holonym %m Member meronym %s Substance meronym %p Part meronym = Attribute + Derivationally related form See: wninput

Word. Net index file lemma (word) POS # pointers abomination n 3 2 @

Word. Net index file lemma (word) POS # pointers abomination n 3 2 @ + 3 0 09613960 07401317 00734041 # synsets synset file offset

Word. Net tools • Many, many tools • General documentation: http: //wordnet. princeton. edu/doc

Word. Net tools • Many, many tools • General documentation: http: //wordnet. princeton. edu/doc • Online query and lookup: http: //wordnet. princeton. edu/perl/webwn • APIs and tools: http: //wordnet. princeton. edu/links • Word. Net: : similarity: http: //wn-similarity. sourceforge. net/ • Word. Net: : similarity web interface: http: //marimba. d. umn. edu/cgi-bin/similarity. cgi

Word. Net and WSD • Milhalcea 2002 describes system to sense encode text using

Word. Net and WSD • Milhalcea 2002 describes system to sense encode text using Word. Net (and related tools and resources)

Milhalcea 2002 • Some tools and resources described: – Senseval • • http: //www.

Milhalcea 2002 • Some tools and resources described: – Senseval • • http: //www. senseval. org/ Evalutation exercises for Word Sense Disambiguation Senseval-1 – 3, held in last several years, workshops at ACL Senseval-4 coming up Data and materials from Senseval-3 can be downloaded Some useful materials for multiple languages Materials and test data for English, Italian, Basque, Catalan, Chinese, Romanian, and Spanish

Milhalcea 2002 • Some tools and resources described: – Semcor • • Sense tagged

Milhalcea 2002 • Some tools and resources described: – Semcor • • Sense tagged Brown corpus Created at Princeton Used for training WSD systems Can be downloaded from Milhalcea’s web site: http: //www. cs. unt. edu/~rada/downloads. html • We’re also planning on installing it on Pongo

Mc. Carthy et al 2004 • Task: find the predominant word senses in untagged

Mc. Carthy et al 2004 • Task: find the predominant word senses in untagged text • Unlike Milhalcea 2002, did not rely on supervised method using Sem. Cor • Built a thesaurus from raw text and Wordnet • Intuition: word sense more likely to be determined from untagged corpus from context, affected by genre, domain or text type • Rather than relying on Sem. Cor’s 250, 000 words, where the word senses are rather limited

Mc. Carthy et al • Thesaurus development relies on dependencies between “neighbors” • Look

Mc. Carthy et al • Thesaurus development relies on dependencies between “neighbors” • Look at distributional similarities between a word and its neighbors

Mc. Carthy et al • Experimented with several similarity measures available in Word. Net:

Mc. Carthy et al • Experimented with several similarity measures available in Word. Net: : similarity • First experiment used Sem. Cor to see how well the unsupervised system worked • 2595 polysemous nouns in Sem. Cor

Mc. Carthy et al • Experiment #2 against SENSEVAL-2 English All Words Data •

Mc. Carthy et al • Experiment #2 against SENSEVAL-2 English All Words Data • Comparison between the precision and recall for Sem. Cor vs. their automatic data (and the SENSEVAL ceiling)

Mc. Carthy et al • Some experiments with domain specific corpora gave these results:

Mc. Carthy et al • Some experiments with domain specific corpora gave these results: