Chapter 7 Text Operations 1 Logical View of

Logical View of a Document automatic or manual indexing document text+ structure recognition structure

Text Operations n n n Lexical analysis of the text Elimination of stopwords Stemming

Lexical Analysis of the Text n Word separators Ø Ø Ø space digits hyphens

Lexical Analysis for Automatic Indexing n n Lexical Analysis Convert an input stream of

Lexical Analysis for Automatic Indexing (Continued) Ø Ø Ø hyphens • break hyphenated words:

Elimination of Stopwords n A list of stopwords Ø Ø words that are too

Stopword n n n Avoid retrieving almost very item in a database regardless of

Stemming n Example Ø Ø Ø n connect, connected, connecting, connections effectiveness --> effective

Stemmers n n n programs that relate morphologically similar indexing and search terms stem

Conflation Methods n n manual automatic (stemmers) Ø Ø n affix removal longest match

Successor Variety n Definition the number of different characters that follow it in words

Successor Variety (Continued) Idea The successor variety of substrings of a term will decrease

n-gram stemmers n diagram a pair of consecutive letters n shared diagram method association

n-gram stemmers n (Continued) Example statistics => st ta at ti is st ti

n-gram stemmers n (Continued) similarity matrix determine the semantic measures for all pairs of

Q-gram stemmers Example Define Q=3 (L+Q-1) statistics => 10+3 -1=12 { ##s, #st, sta,

Affix Removal Stemmers n procedure Remove suffixes and/or prefixes from terms leaving a stem,

Index Terms Selection n Motivation Ø Ø n A sentence is usually composed of

Index Terms Selection n Indexing by single words Ø Ø n Indexing by phrases

Thesauri n n Peter Roget, 1988 Example cowardly adj. Ignobly lacking in courage: cowardly

The Purpose of a Thesaurus n n n To provide a standard vocabulary for

Functions of thesauri n n n Provide a standard vocabulary for indexing and searching

Usage n n Indexing Select the most appropriate thesaurus entries for representing the document.

Document Clustering n Global clustering Ø n The grouping of documents accordingly to their

Typical Clustered File Organization clusters superclusters Hypercentroid Supercentroids Centroids Documents complete space 26

Search Strategy for Clustered Documents Highest-level centroid Supercentroids Centroids Documents Typical Search path 27

Text Compression n n Finding ways to represent the text in fewer bits or

Slides: 28

Download presentation

Chapter 7 Text Operations 1

Logical View of a Document automatic or manual indexing document text+ structure recognition structure accents, spacing, etc. text stopwords full text noun groups stemming index terms 2

Text Operations n n n Lexical analysis of the text Elimination of stopwords Stemming Selection of index terms Construction of term categorization structures 3

Lexical Analysis of the Text n Word separators Ø Ø Ø space digits hyphens punctuation marks the case of the letters 4

Lexical Analysis for Automatic Indexing n n Lexical Analysis Convert an input stream of characters into stream words or token. What is a word or a token? Tokens consist of letters. Ø digits: Most numbers are not good index terms. counterexamples: case numbers in a legal database, “B 6” and “B 12” in vitamin database. 5

Lexical Analysis for Automatic Indexing (Continued) Ø Ø Ø hyphens • break hyphenated words: state-of-theart, state of the art • keep hyphenated words as a token: “Jean -Claude”, “F-16” punctuation marks: often used as parts of terms, e. g. , OS/2, 510 B. C. case: usually not significant in index terms 6

Elimination of Stopwords n A list of stopwords Ø Ø words that are too frequent among the documents articles, prepositions, conjunctions, etc. n Can reduce the size of the indexing structure considerably n Problem Ø Search for “to be or not to be”? 7

Stopword n n n Avoid retrieving almost very item in a database regardless of its relevance. Examples Ø conservative approach (ORBIT Search Service): and, an, by, from, of, the, with Ø (derived from Brown corpus): 425 words a, about, above, across, after, against, all, almost, alone, along, already, also, although, always, among, and, another, anybody, anyone, anything, anywhere, area, areas, around, ask, asked, asking, asks, at, away, b, backed, backing, backs, because, became, . . . Articles, prepositions, conjunctions, … 8

Stemming n Example Ø Ø Ø n connect, connected, connecting, connections effectiveness --> effective --> effect picnicking --> picnic Removing strategies Ø Ø affix removal: intuitive, simple table lookup successor variety n-gram 9

Stemmers n n n programs that relate morphologically similar indexing and search terms stem at indexing time Ø advantage: efficiency and index file compression Ø disadvantage: information about the full terms is lost example (CATALOG system), stem at search time Look for: system users Search Term: users Term Occurrences 1. user 15 2. users 1 3. used 3 4. using 2 The user selects the terms he wants by numbers 10

Conflation Methods n n manual automatic (stemmers) Ø Ø n affix removal longest match vs. simple removal successor variety table lookup n-gram Term evaluation Ø Ø Ø correctness retrieval effectiveness compression performance engineering engineered engineer Stem engineer 11

Successor Variety n Definition the number of different characters that follow it in words in some body of text n Example a body of text: able, axle, accident, ape, about successor variety of apple 1 st: 4 (b, x, c, p) 2 nd: 1 (e) 12

Successor Variety (Continued) Idea The successor variety of substrings of a term will decrease as more characters are added until a segment boundary is reached, i. e. , the successor variety will sharply increase. n Example Test word: READABLE Corpus: ABLE, BEATABLE, FIXABLE, READS, READABLE, READING, RED, ROPE, RIPE Prefix Successor Variety Letters R 3 E, O, I RE 2 A, D REA 1 D READ 3 A, I, S READA 1 B READAB 1 L READABL 1 E READABLE 1 blank n 13

n-gram stemmers n diagram a pair of consecutive letters n shared diagram method association measures are calculated between pairs of terms where A: the number of unique diagrams in the first word, B: the number of unique diagrams in the second, C: the number of unique diagrams shared by A and B. 14

n-gram stemmers n (Continued) Example statistics => st ta at ti is st ti ic cs unique diagrams => at cs ic is st ta ti statistical => st ta at ti is st ti ic ca al unique diagrams => al at ca ic is st ta ti 15

n-gram stemmers n (Continued) similarity matrix determine the semantic measures for all pairs of terms in the database word 1 word 2 word 3. . . word 1 wrod 2 S 21 word 3 S 31. . wordn Sn 1 n n wordn-1 S 32 Sn 3 … Sn(n-1) terms are clustered using a single link clustering method more a term clustering procedure than a stemming one 16

Q-gram stemmers Example Define Q=3 (L+Q-1) statistics => 10+3 -1=12 { ##s, #st, sta, tat, ati, tis, ist, sti, tic, ics, cs#, s## } statistical => 11+3 -1=13 {##s, #st, sta, tat, ati, tis, ist, sti, tic, ica, cal, al#, l##} n Q-gram Distance = 4 17

Affix Removal Stemmers n procedure Remove suffixes and/or prefixes from terms leaving a stem, and transform the resultant stem. n example: plural forms If a word ends in “ies” but not “eies” or “aies” then “ies” --> “y” If a word ends in “es” but not “aes”, “ees”, or “oes” then “es” --> “e” If a word ends in “s”, but not “us” or “ss” then “s” --> NULL 18

Index Terms Selection n Motivation Ø Ø n A sentence is usually composed of nouns, pronouns, articles, verbs, adjectives, adverbs, and connectives. Most of the semantics is carried by the noun words. Identification of noun groups Ø A noun group is a set of nouns whose syntactic distance in the text does not exceed a predefined threshold 19

Index Terms Selection n Indexing by single words Ø Ø n Indexing by phrases Ø n single words are often ambiguous and not specific enough for accurate discrimination of documents bank terminology vs. terminology bank Syntactic phrases are almost always more specific than single words Indexing by single words and phrases 20

Thesauri n n Peter Roget, 1988 Example cowardly adj. Ignobly lacking in courage: cowardly turncoats Syns: chicken (slang), chicken-hearted, craven, dastardly, faint-hearted, gutless, lily-livered, pusillanimous, unmanly, yellow (slang), yellow-bellied (slang). n A controlled vocabulary for the indexing and searching 21

The Purpose of a Thesaurus n n n To provide a standard vocabulary for indexing and searching To assist users with locating terms for proper query formulation To provide classified hierarchies that allow the broadening and narrowing of the current query request 22

Functions of thesauri n n n Provide a standard vocabulary for indexing and searching Assist users with locating terms for proper query formulation Provide classified hierarchies that allow the broadening and narrowing of the current query request 23

Usage n n Indexing Select the most appropriate thesaurus entries for representing the document. Searching Design the most appropriate search strategy. Ø If the search does not retrieve enough documents, thesaurus can be used to expand the query. Ø If the search retrieves too many items, thesaurus can suggest more specific search vocabulary. 24

Document Clustering n Global clustering Ø n The grouping of documents accordingly to their occurrence in the whole collection Local clustering: Ø The grouping of the local set of retrieved documents by a query 25

Typical Clustered File Organization clusters superclusters Hypercentroid Supercentroids Centroids Documents complete space 26

Search Strategy for Clustered Documents Highest-level centroid Supercentroids Centroids Documents Typical Search path 27

Text Compression n n Finding ways to represent the text in fewer bits or bytes Statistical Methods Ø n Dictionary-based Ø n Huffman coding Ziv-Lempel Inverted File compression 28