CS 430 INFO 430 Information Retrieval Lecture 25

  • Slides: 29
Download presentation
CS 430 / INFO 430 Information Retrieval Lecture 25 Thesauruses and Cluster Analysis 1

CS 430 / INFO 430 Information Retrieval Lecture 25 Thesauruses and Cluster Analysis 1 1

Course Administration CS 490 and CS 790 Independent Research Projects • Web Research Infrastructure

Course Administration CS 490 and CS 790 Independent Research Projects • Web Research Infrastructure -- build a system to bring complete crawls of the Web from the Internet Archive to the Cornell Theory Center and make them available for researchers through a standard API. (Continues planning work carried out this semester. ) [Strong programming skills required. ] • There will not be an independent research project in information retrieval. 2

Course Administration Final Examination • The final examination is on Monday, December 13, between

Course Administration Final Examination • The final examination is on Monday, December 13, between 12: 00 and 1: 30. • A make-up examination will be available on another date, which has not yet been chosen. The proposed date is December 9. If you would like to take the make-up examination, send an email message to Anat Nidar-Levi (anat@cs. cornell. edu). 3

Lexicon and Thesaurus Lexicon contains information about words, their morphological variants, and their grammatical

Lexicon and Thesaurus Lexicon contains information about words, their morphological variants, and their grammatical usage. Thesaurus relates words by meaning: ship, vessel, sail; craft, navy, marine, fleet, flotilla book, writing, work, volume, tome, tract, codex search, discovery, detection, find, revelation (From Roget's Thesaurus, 1911) 4

Thesaurus in Information Retrieval Use of a thesaurus in indexing (precoordination) A. Manual Used

Thesaurus in Information Retrieval Use of a thesaurus in indexing (precoordination) A. Manual Used to guide human indexer to assign standard terms and associations. computer-aided instruction see also education UF teaching machines BT educational computing TT computer applications RT education RT teaching 5 From: INSPEC Thesaurus

Thesaurus in Information Retrieval Use of a thesaurus in indexing (precoordination) B. Automatic Divide

Thesaurus in Information Retrieval Use of a thesaurus in indexing (precoordination) B. Automatic Divide terms into thesaurus classes. Replace similar terms by a thesaurus class. 6 408 dislocation 409 blast-cooled junction heat-flow minority-carrier heat-transfer n-p-n p-n-p 410 anneal point-contact strain recombine transition unijunction From: Salton and Mc. Gill

Desirable Properties for Information Retrieval 7 • Thesaurus is specific to a subject area.

Desirable Properties for Information Retrieval 7 • Thesaurus is specific to a subject area. Contains only terms of interest for identification within that subject area. • Ambiguous terms are coded only for the senses important for that field. • Target is that each thesaurus class should include terms of moderate frequency. Ideally the classes should have similar frequency.

Alexandria Thesaurus: Example canals A feature type category for places such as the Erie

Alexandria Thesaurus: Example canals A feature type category for places such as the Erie Canal. Used for: The category canals is used instead of any of the following. canal bends ditches canalized streams drainage canals ditch mouths drainage ditches Broader Terms: Canals is a sub-type of hydrographic structures. 8 . . . more. . .

Alexandria Thesaurus: Example (continued) canals (continued) Related Terms: The following is a list of

Alexandria Thesaurus: Example (continued) canals (continued) Related Terms: The following is a list of other categories related to canals (nonhierarchial relationships). channels locks transportation features tunnels Scope Note: Manmade waterway used by watercraft or for drainage, irrigation, mining, or water power. » Definition of canals. 9

Art and Architecture Thesaurus • Controlled vocabulary for describing and retrieving information: fine art,

Art and Architecture Thesaurus • Controlled vocabulary for describing and retrieving information: fine art, architecture, decorative art, and material culture. • Almost 120, 000 terms for objects, textual materials, images, architecture and culture from all periods and all cultures. • Used by archives, museums, and libraries to describe items in their collections. • Used to search for materials. • Used by computer programs, for information retrieval, and natural language processing. A project of the J. Paul Getty Trust 10

Art and Architecture Thesaurus Provides the terminology for objects, and the vocabulary necessary to

Art and Architecture Thesaurus Provides the terminology for objects, and the vocabulary necessary to describe them, such as style, period, shape, color, construction, or use, and scholarly concepts, such as theories, or criticism. Concept: a cluster of terms, one of which is established as the preferred term, or descriptor. Categories: associated concepts, physical attributes, styles and periods, agents, activities, materials, and objects. 11

Art and Architecture Thesaurus: Sample Record ID: 198841 Descriptor: rhyta Note: Refers to vessels

Art and Architecture Thesaurus: Sample Record ID: 198841 Descriptor: rhyta Note: Refers to vessels from Ancient Greece, eastern Europe, or the Middle East that typically have a closed form with two openings, one at the top for filling and one at the base so that liquid could stream out. They are often in the shape of a horn or an animal's head, and were typically used as a drinking cup or for pouring wine into another vessel. 12 Hierarchy: Containers [TQ]. . . <containers by function or context>. . . <culinary containers>. . . . . <containers for serving and consuming food>

Art and Architecture Thesaurus: Sample Record (continued) Terms: rhyta rhyton (alternate, singular) protomai protome

Art and Architecture Thesaurus: Sample Record (continued) Terms: rhyta rhyton (alternate, singular) protomai protome rhea rheons Related concepts: stirrup cups sturzbechers drinking vessels ceremonial vessels 13

Automatic Thesaurus Construction Approach 14 • Select a subject domain. • Choose a corpus

Automatic Thesaurus Construction Approach 14 • Select a subject domain. • Choose a corpus of documents that cover the domain. • Create vocabulary by extracting terms, normalization, precoordination of phrases, etc. • Devise a measure of similarity between terms and thesaurus classes. • Cluster terms into thesaurus classes, using complete linkage or other cluster method that generates compact clusters.

Normalization of vocabulary Normalization rules map variant forms into base expressions. Typical normalization rules

Normalization of vocabulary Normalization rules map variant forms into base expressions. Typical normalization rules for manual thesaurus construction are: (a) Nouns only, or nouns and noun phrases. (b) Singular nouns only. (c) Spelling (e. g. , U. S. ). (d) Capitalization, punctuation (e. g. , hyphens), initials (e. g. , IBM), abbreviations (e. g. , Mr. ). Usually, many possible decisions can be made, but they should be followed consistently. Which of these can be carried out automatically with reasonable accuracy? 15

Terms to include • Only terms that are likely to be of interest for

Terms to include • Only terms that are likely to be of interest for content identification. • High-frequency terms should be ignored (large stop-list). • Ambiguous terms should be coded for the senses likely to be important in the document collection. • Each thesaurus class should have approximately the same frequency of occurrence. • Terms of negative discrimination should be eliminated. after Salton and Mc. Gill 16

Discriminant value is the degree to which a term is able to discriminate between

Discriminant value is the degree to which a term is able to discriminate between the documents of a collection = (average document similarity without term k) - (average document similarity with term k) Good discriminators decrease the average document similarity Note that this definition uses the document similarity. 17

Incidence array D 1: D 2: D 3: D 4: alpha bravo charlie delta

Incidence array D 1: D 2: D 3: D 4: alpha bravo charlie delta echo foxtrot golf delta alpha bravo charlie bravo echo foxtrot bravo foxtrot alpha golf delta alpha bravo charlie delta D 1 1 D 2 1 D 3 D 4 18 1 1 1 foxtrot golf 1 1 1 echo 1 1 7 1 3 4 1 1 1 4

Document similarity matrix D 1 D 2 D 3 D 4 0. 65 0.

Document similarity matrix D 1 D 2 D 3 D 4 0. 65 0. 76 0. 00 0. 87 D 2 0. 65 D 3 0. 76 0. 00 D 4 0. 76 0. 87 0. 25 Average similarity = 0. 55 19

Discriminant value Repeat excluding one term at a time without alpha bravo charlie delta

Discriminant value Repeat excluding one term at a time without alpha bravo charlie delta echo 20 average similarity 0. 53 0. 56 -0. 02 +0. 01 foxtrot 0. 52 golf 0. 53 Average similarity = 0. 55 -0. 03 -0. 02 DV alpha, delta, foxtrot, golf are bad discriminators

Phrase construction In a precoordinated thesaurus, term classes may contain phrases. Informal definitions: pair-frequency

Phrase construction In a precoordinated thesaurus, term classes may contain phrases. Informal definitions: pair-frequency (i, j) is the frequency that a given pair of words occur in context (e. g. , in succession within a sentence) phrase is a pair of words, i and j that occur in context with a higher frequency than would be expected from their overall frequency cohesion (i, j) = 21 observed pair-frequency (i, j) expected pair-frequency if i, j independent

Phrase construction: simple case Example: corpus of n terms pi, j is the observed

Phrase construction: simple case Example: corpus of n terms pi, j is the observed frequency that a given pair of terms occur in succession. fi is the number of occurrences of term i in the corpus. There are n-1 pairs. If the terms are independent, the probability that a given pair begins with term i and ends with term j is (fi/n). (fj/n) cohesion (i, j) = 22 n 2. pi, j (n-1)fi. fj

Phrase construction Salton and Mc. Gill algorithm 1. Computer pair-frequency for all terms. 2.

Phrase construction Salton and Mc. Gill algorithm 1. Computer pair-frequency for all terms. 2. Reject all pairs that fall below a certain frequency threshold 3. Calculate cohesion values 4. If cohesion above a threshold value, consider word pair as a phrase. Automatic phrase construction by statistical methods is rarely used in practice. There is promising research on phrase identification using methods of computational linguistics 23

Similarities The vocabulary consists of a set of elements, each of which can be

Similarities The vocabulary consists of a set of elements, each of which can be a single term or a phrase. The next step is to calculate a measure of similarity between elements. One measure of similarity is the number of documents that have terms i and k in common: n S(tj, tk) = tijtik i=1 where tij = 1 if document i contains term j and 0 otherwise. 24

Similarities: Incidence array alpha bravo charlie delta D 1 1 D 2 1 D

Similarities: Incidence array alpha bravo charlie delta D 1 1 D 2 1 D 3 25 1 1 n 3 1 1 foxtrot golf 1 1 1 D 4 1 echo 1 1 2 2 3 1 2 1 1 1 3 3

Term similarity matrix alpha 26 bravo 1 charlie delta echo foxtrot golf 1 3

Term similarity matrix alpha 26 bravo 1 charlie delta echo foxtrot golf 1 3 1 2 3 2 1 2 2 1 1 2 3 2 1 bravo 1 charlie 1 2 delta 3 1 1 echo 1 2 2 1 foxtrot 2 2 2 golf 3 1 1 3 1 2 2 Using count of documents that have two terms in common

Similarity measures Improved similarity measures can be generated by: • Using term frequency matrix

Similarity measures Improved similarity measures can be generated by: • Using term frequency matrix instead of incidence matrix • Weighting terms by frequency: cosine measure n S(tj, tk) = t t i=1 ij ik |tj| |tk| dice measure n S(tj, tk) = tijtik i=1 n tik + tij i=1 27 n i=1

Term similarity matrix alpha 28 bravo 0. 2 charlie delta echo foxtrot golf 0.

Term similarity matrix alpha 28 bravo 0. 2 charlie delta echo foxtrot golf 0. 2 0. 5 0. 2 0. 33 0. 5 0. 2 0. 5 0. 4 0. 2 0. 33 0. 5 0. 4 0. 2 bravo 0. 2 charlie 0. 2 0. 5 delta 0. 5 0. 2 echo 0. 2 0. 5 0. 2 foxtrot 0. 33 0. 4 golf 0. 5 0. 2 Using incidence matrix and dice weighting 0. 33

Clustering terms to form concepts The final stage is to group similar terms together

Clustering terms to form concepts The final stage is to group similar terms together into concepts. This is done by cluster analysis. Cluster analysis is the topic of the next lecture. 29