Discussion Class 3 Stemming Algorithms 1 Discussion Classes

Discussion Classes Format: Question Ask a member of the class to answer Provide opportunity

Question 1: Conflation methods (a) Define the terms: stem, suffix, prefix, conflation, morpheme (b)

Question 2: CATALOG System Search Term: users Term 1. user 2. users 3. used

Question 3: Successor variety methods Test word: FINDABLE Corpus: ABLE, APE, DABBLE, FIND, FINDABLE,

Question 3 (continued): Successor variety methods (a) Segment FINDABLE using the complete word segmentation

Question 4: n-gram methods (a) Explain the following notation: statistics => st ta at

Question 5: Porter's algorithm (a) What is an iterative, longest match stemmer? (b) How

Question 6: Porter's algorithm Conditions Suffix Replacement Examples (m > 0) eed ee feed

Question 7: Evaluation (a) In Web search engines, the tendency is not to use

Slides: 10

Download presentation

Discussion Class 3 Stemming Algorithms 1

Discussion Classes Format: Question Ask a member of the class to answer Provide opportunity for others to comment When answering: Give your name. Make sure that the TA hears it. Stand up Speak clearly so that all the class can hear 2

Question 1: Conflation methods (a) Define the terms: stem, suffix, prefix, conflation, morpheme (b) Define the terms in the following diagram: Conflation methods Manual Automatic (stemmers) Affix removal Longest match 3 Successor variety Simple removal Table lookup n-gram

Question 2: CATALOG System Search Term: users Term 1. user 2. users 3. used 4. using Occurrences 15 1 3 2 Which term (0 = none, CR = all): (a) The CATALOG stemmer differs in a fundamental way from other tools that we have seen in this course. What is it? (b) What impact does this have on measurements of precision and recall? 4

Question 3: Successor variety methods Test word: FINDABLE Corpus: ABLE, APE, DABBLE, FIND, FINDABLE, FINDER, FOUND, FINISH, FIXABLE Prefix F FI FINDAB FINDABLE 5 (a) Fill in this table Successor Variety Letters

Question 3 (continued): Successor variety methods (a) Segment FINDABLE using the complete word segmentation method. (b) Segment FINDABLE using the peak and plateau method. 6

Question 4: n-gram methods (a) Explain the following notation: statistics => st ta at ti is st ti ic cs unique diagrams =>at cs ic is st ta ti statistical => st ta at ti is st ti ic ca al unique diagrams => al at ca ic is st ta ti (b) Calculate the similarity using Dice's coefficient: S = 2 C A+B A is the number of unique diagrams in the first term B is the number of unique diagrams in the second term C is the number of shared unique diagrams (c) Explain the statement that it is a bit confusing to call this a "stemming method". 7

Question 5: Porter's algorithm (a) What is an iterative, longest match stemmer? (b) How is longest match achieved in the Porter algorithm? 8

Question 6: Porter's algorithm Conditions Suffix Replacement Examples (m > 0) eed ee feed -> feed agreed -> agree (*v*) ed null plastered -> plaster bled -> bled (*v*) ing null motoring -> motor sing -> sing (a) Explain this table (b) How does this table apply to: "exceeding", "ringed"? 9

Question 7: Evaluation (a) In Web search engines, the tendency is not to use stemming. Why? (There at least three answers. ) (b) Does your answer to part (a) mean that stemming is no longer useful? 10