outline 1 Introduction 1 2 3 4 Thesaurus

  • Slides: 50
Download presentation

outline 1. Introduction 1) 2) 3) 4) 索引典(Thesaurus)定義 索引典結構 INSPEC thesaurus 索引典參照款目說明 2. Features

outline 1. Introduction 1) 2) 3) 4) 索引典(Thesaurus)定義 索引典結構 INSPEC thesaurus 索引典參照款目說明 2. Features of thesauri 1) 2) 3) 4) 5) 6) Coordination level Term relationships Number of entries for each term Specificity of vocabulary Control on term frequency of class members Normalization of vocabulary 3. Thesaurus Construction 1) Manual Thesaurus Construction 2) Automatic Thesaurus Construction 3. 2. 1 Thesaurus Construction from Text 1. 2. 3. 4. 5. Automatic Thesaurus Construction From a Collection of Document Items By Merging Existing Thesauri User Generated Thesaurus Construction of Vocabulary 3. 2. 2 Merging existing thesauri 4. Conclusion 2

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 1. 3 INSPEC

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 1. 3 INSPEC thesaurus This thesaurus is designed for the INSPEC domain, which covers physics, electrical engineering, electronics, as well as computers and control. 2. The thesaurus is logically organized as a set of hierarchies. 3. Includes an alphabetical listing of thesaural terms. 4. Each hierarchy is built from a root term representing a high-level concept in the domain. 1. 5

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion A short extract

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion A short extract from the 1979 INSPEC thesaurus l Cesium l USE caesium Computer-aided instruction see also education UF teaching machines BT educational computing TT computer applications RT education teaching CC C 7810 C FC c 7810 cf l l l l 6

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 1. 4 索引典參照款目說明

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 1. 4 索引典參照款目說明 USE (被替代) The “see also(參見)” link leads to cross-referenced thesaural terms. NT (narrower terms較狹義字)suggest a more specific thesaural term. BT (broader terms較廣義字) provides a more general thesaural term. TT (Top term, 最BT的term) RT signifies a related term(相關字). UF (替代)is utilized to indicate the chosen form from a set of alternatives. CC Classification Codes類別代碼 7

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 2. Features of

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 2. Features of thesauri 1) 2) 3) 4) 5) 6) Coordination level Term relationships Number of entries for each term Specificity of vocabulary Control on term frequency of class members Normalization of vocabulary 8

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 1) 2) 3)

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 1) 2) 3) 4) 5) 2. 1 Coordination level The construction of phrases from individual terms. Two coordination options : pre-coordination and postcoordination. A precoordinated thesaurus can contain phrases. The advantage is that the vocabulary is very precise. The disadvantage is that the searcher has to be aware of the phrase construction rules employed. Precoordination is more common in manually constructed thesauri. 9

7) A postcoordinated thesaurus does not allow phrases. Instead, phrases are constructed while searching.

7) A postcoordinated thesaurus does not allow phrases. Instead, phrases are constructed while searching. 8) The advantage is that the user need not worry about the exact ordering of the words in phrase. 9) The disadvantage is that search precision may fall. 10) Automatic thesaurus construction usually implies postcoordination. 10

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion Coordination level組合層次 前組合

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion Coordination level組合層次 前組合 l 將此述語概念視為 一個複合名詞,例 如:diesel  l locomotive(柴油引擎火 車頭) 後組合 l 以現存的二個或二 個以上的詞彙代替 而組合,例如: diesel engines AND locomotive l 11

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 2. 2 Term

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 2. 2 Term Relationships詞彙間的關係 Three categories of term relationships: (a) Equivalence relationships(同義關係) (b) Hierarchical relationships(層屬關係) (c) Nonhierarchical relationships(非層屬關係) 12

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 2. 2 a)

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 2. 2 a) Equivalence relationships Equivalence relations include both synonymy(同 義字) and quasi-synonymy(半同義字). For example: genetics(遺傳) and heredity; harshness and tenderness 13

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 2. 2 b)

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 2. 2 b) Hierarchical relationships 層級關係 A typical example of a hierarchical relation is genus(屬)-species(種), such as ”dog” and “german shepherd(牧羊犬近親). ” 15

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 2. 2 c)

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 2. 2 c) Nonhierarchical relationships 非層級關係 Nonhierarchical relationships also identify conceptually related terms. There are many examples including : thing—part such as “bus” and “seat”; thing—attribute such as “rose” and “fragrance(香味)”. 17

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion l 相關關係 是指認可述語間的關連,一般採用RT的參照符

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion l 相關關係 是指認可述語間的關連,一般採用RT的參照符 號來連結,例如: 表示事或物的全部與部份的關係,windows RT houses l 表示事或物與其處理作業的關係,skates RT skating l 表示事或物與其應用的關係,railway construction RT railway l 表示事或物與其特性的關係,seawater RT corrosion(侵 蝕) l 18

Wang, Vandendorpe, and Evens (1985) provide an alternative classification of term relationships consisting of:

Wang, Vandendorpe, and Evens (1985) provide an alternative classification of term relationships consisting of: (1)parts—wholes(整部關係) (2)collocation relations(排列關係) (3)paradigmatic relations(範例關係) (4)taxonomy and synonymy(分類及同義字) (5)antonymy relations(反義字) 19

(1)Parts-wholes Parts and wholes include examples such as set(集合) —element(元素); count—mass. 20

(1)Parts-wholes Parts and wholes include examples such as set(集合) —element(元素); count—mass. 20

(2)Collocation relations 編排關係 Collection relates words that frequently co-occur in the same phrase or

(2)Collocation relations 編排關係 Collection relates words that frequently co-occur in the same phrase or sentence. 21

(3)Paradigmatic relations l Paradigmatic relations relate words that have the same semantic core like

(3)Paradigmatic relations l Paradigmatic relations relate words that have the same semantic core like “moon” and “lunar” and are somewhat similar to Aitchison and Gilchrist’s quasi-synonymy relationship. 22

(4)Taxonomy and synonymy l Taxonomy and synonymy are selfexplanatory and refer to the classical

(4)Taxonomy and synonymy l Taxonomy and synonymy are selfexplanatory and refer to the classical relations between terms. 23

(5) antonymy relations 24

(5) antonymy relations 24

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 2. 3 Number

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 2. 3 Number of entries for each term It is in general preferable to have a single entry for each thesaurus term. However , this is seldom achieved due to the presence of homographs—words with multiple meanings. 2. In a manually constructed thesaurus such as INSPEC, this problem is resolved by the use of parenthetical qualifiers(括弧限定語), as in the pair of homographs, bonds化學鍵(chemical) and bonds粘合劑(adhesive膠帶). 3. However, this is hard to achieve automatically. 1. 25

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 2. 4 Specificity

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 2. 4 Specificity of vocabulary 詞彙的明確性 1. 2. 3. 4. 5. a function of the precision associated with the component terms. A highly specific vocabulary is able to express the subject in great depth and detail. This promotes precision in retrieval. The disadvantage is that the size of the vocabulary grows. Also, specific terms tend to change more rapidly than general terms. There, such vocabularies tend to require more regular maintenance. High specificity implies a high coordination level and user has to be more concerned with the rules for phrase 27 construction.

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 2. 5 Control

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 2. 5 Control on term frequency of class members 1. 2. 3. 4. Salton and Mc. Gill have stated that in order to maintain a good match between documents and queries, it is necessary to ensure that terms included in the same thesaurus class have roughly equal frequencies. The total frequency in each class should also be roughly similar. These constraints are imposed to ensure that the probability of a match between a query and a document is the same across classes. Terms within the same class should be equally specific, and the specificity across classed should also be the same. 28

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 2. 6 Normalization

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 2. 6 Normalization of vocabulary詞彙的標準化 1. 2. 3. 4. 5. 6. 最好以名詞方式表示 名詞片語應避免採用頭字語,除非大家都知道 可採用形容詞 There are other rules to direct issues such as the singularity of terms(單數), the ordering of terms within phrases(在片語中的順序 ), spelling(拼法), capitalization(大寫), transliteration(字譯), abbreviations(縮寫), initials(字首), acronyms(字首縮寫), and punctuation(標點符號). The advantage is that variant forms are mapped into base expressions, thereby bringing consistency to the vocabulary. The disadvantage is that, in order to be used effectively, the user has to be well aware of the normalization rules used. 29

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 3. 1 Manual

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 3. 1 Manual Thesaurus Construction l Define the boundaries of the subject area – Identify central subject areas and peripheral ones – Partition the domain into divisions or subareas l Identify desired characteristics l Collect terms for each subarea – Sources from index, encyclopedia, handbook, textbook, journal, abstract, catalog, existing thesaurus or vocabulary systems – Including: subject expert and potential user 30

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 3. 1 Manual

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 3. 1 Manual Thesaurus Construction (continued) l Analyze each term for its related vocabulary – Including synonyms, broader and narrower term, definition and scope note l Organize term and relationship into hierarchical structure l Review or refine for consistency l Invert the structured thesaurus to produce an alphabetical arrangement of entries l Test thesaurus 31

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 3. 1 Manual

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 3. 1 Manual Thesaurus Construction (continued) l Conclusion: – Involve a group of individuals and a variety of resources – Need to be maintained to ensure viability and effectiveness – Reflect any changes in the terminology of the area =>An art and a science 32

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 3. 2 Automatic

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 3. 2 Automatic Thesaurus Construction l From document collections – Use a collection of documents as the source for thesaurus construction – Apply statistical procedures to identify important terms as well as relationships – Use computationally simpler methods to identify the more important semantic knowledge 33

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 3. 2 Automatic

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 3. 2 Automatic Thesaurus Construction l Merge existing thesaurus (continued) – Merge two or more thesauri into a single unit – Merger should not violate the integrity of any component thesaurus – e. g. augment Me. SH from SNOMED 34

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 3. 2 Automatic

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 3. 2 Automatic Thesaurus Construction l User generated thesaurus (continued) – Uses of term relationship in search strategies – Capture knowledge from user’s search – e. g. TEGEN (Thesaurus Generating system) l The types of Boolean operators between terms l The type of query modification l User feedback included 35

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 3. 2. 1

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 3. 2. 1 Thesaurus Construction from Texts l Process – 1 Construction of vocabulary l Normalization l Selection of terms l Phrase construction l Identify the statistical associations between terms – 2 Similarity computations – 3 Organization of vocabulary l Organize the selected vocabulary into hierarchy 36

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion • From document

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion • From document collections • Merge existing thesaurus Process 1 Construction of vocabulary 2 Similarity computations 3 Organization of vocabulary 1 Construction of Vocabulary l Objective: Identify the most informative terms (words, phrases) – Identify an appropriate document collection which should be sizable and representative of the subject area – Determine the required specificity – Vocabulary for normalization l Eliminate trivial words and construct a stoplist l Stem the vocabulary 37

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion • From document

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion • From document collections • Merge existing thesaurus Process 1 Construction of vocabulary 2 Similarity computations 3 Organization of vocabulary Selection by Frequency of Occurrence l Selection by frequency of occurrence – Each term placed in one frequency category: high, medium, low – Medium: best for indexing and abstracting – Low: minimal impact on retrieval – High: too general and negatively impact search precision 38

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion • From document

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion • From document collections • Merge existing thesaurus Process 1 Construction of vocabulary 2 Similarity computations 3 Organization of vocabulary Selection by Discrimination Value (DV) l Selection by discrimination value (DV) – DV measures the degree to which a term is able to discriminate or distinguish between the documents – The more discriminating a term, the higher its value as an index term – Using some similarity functions to compute the average inter-document similarity in the collection – DV(k) = (Average similarity without k) - (Average similarity with k) 39

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion • From document

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion • From document collections • Merge existing thesaurus Process 1 Construction of vocabulary 2 Similarity computations 3 Organization of vocabulary Selection by Discrimination Value (DV) (continued) l Selection by discrimination value (DV) – Good discriminators are those that decrease the average similarity by their presence DV is positive – Poor discriminators have negative DV – Neutral discriminators have no effect on average similarity – Terms that are positive discriminators can be included in the vocabulary and the rest rejected 40

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion • From document

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion • From document collections • Merge existing thesaurus Process 1 Construction of vocabulary 2 Similarity computations 3 Organization of vocabulary Selection by Poisson Method l Selection by Poisson Method – Poisson distribution is a discrete random distribution that can be used to model a variety of random phenomena – Trivial words have a single Poisson distribution – Distribution of nontrivial words deviates significantly from a Poisson distribution 41

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion • From document

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion • From document collections • Merge existing thesaurus Process 1 Construction of vocabulary 2 Similarity computations 3 Organization of vocabulary Phrase Construction Phrase construction decrease the frequency of high-frequency terms and increase their value for retrieval l Salton and Mc. Gill Procedure: a statistical alternative to syntactic and/or semantic methods for identifying and constructing phrases l – The component words of a phrase should occur frequently in a common context – The component words should represent broad concepts, and their frequency of occurrence should be sufficiently high 42

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion • From document

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion • From document collections • Merge existing thesaurus Process 1 Construction of vocabulary 2 Similarity computations 3 Organization of vocabulary Phrase Construction (continued) – Criteria 1. Compute pair wise co-occurrence for high-frequency words 2. If this co-occurrence is lower than a threshold, then do not consider the pair any further 3. For pairs that qualify, compute the cohesion value 4. If cohesion is above a second threshold, retain the phrase as a valid vocabulary phrase cohesion (ti, tj) = co-occurrence-frequency / sqrt ( frequency (ti) * (frequency (tj) ) cohesion (ti, tj) = size-factor * (co-occurrence-frequency / (total-frequency (ti) * ( totalfrequency (tj) ) ) 43

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion • From document

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion • From document collections • Merge existing thesaurus Process 1 Construction of vocabulary 2 Similarity computations 3 Organization of vocabulary Choueka Procedure (continued) l Choueka Procedure Identifying collocational expressions by the phrases whose meaning cannot be derived in a simple way from that of the component words (e. g. artificial intelligence) 1. Select the range of length allowed for each collocational expression 2. Build a list of all potential expressions from the collection with prescribed length that have a minimum frequency 3. Delete sequences the begin or end with a trivial word 44

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion • From document

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion • From document collections • Merge existing thesaurus Process 1 Construction of vocabulary 2 Similarity computations 3 Organization of vocabulary Choueka Procedure (continued) 4. Delete expressions that contain high-frequency nontrivial words 5. Given an expression such a b c d evaluate any potential subexpressions for relevance. Discard any that are not sufficiently relevant 6. Try to merge small expressions into large and more meaningful ones 45

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion • From document

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion • From document collections • Merge existing thesaurus Process 1 Construction of vocabulary 2 Similarity computations 3 Organization of vocabulary 4. 2 Similarity computations between terms l To determine the statistical similarity between pairs of terms. l Dice: , if l 1=0 or l 2 =0 return 0 l Cosine: , if l 1=0 or l 2 =0 return 0 l 1: # of terms associated with document 1 l 2: # of terms associated with document 2 common: # of terms in common between them 46

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion • From document

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion • From document collections • Merge existing thesaurus Process 1 Construction of vocabulary 2 Similarity computations 3 Organization of vocabulary 4. 3 Organization of vocabulary l Two assumption High-frequency words have broad meaning l If the density functions of the two terms, p and q (of varying frequencies) have the same shape, then the two words have similar meaning. l l As two assumptions, if p is the term with the higher frequency, then q becomes a child of p. 47

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion • From document

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion • From document collections • Merge existing thesaurus Process 1 Construction of vocabulary 2 Similarity computations 3 Organization of vocabulary l Identify a set of frequency ranges. l Group the vocabulary terms into different classes based on their frequencies. l The highest frequency class is assigned level 0, the next, level 1 and so on. l Parent-child links are determined. l Create “dummy” term. 48

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion • From document

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion • From document collections • Merge existing thesaurus Merging existing thesauri l Simple-merge – Two terms in different hierarchies are merged if they are identical. l Complex-merge – Any two terms in different hierarchies are merged if they have ‘similar’ parents and children. 49

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion conclusion l This

1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion conclusion l This chapter began with an introduction to thesauri. l Two major automatic thesaurus construction methods have been detailed. l A few related issues to thesauri have not been considered here: – Evaluation of thesauri. – Maintenance of thesauri. – How to automate the usage of thesauri. 50