Discussion Class 5 Stemming Algorithms 1 Discussion Classes

  • Slides: 10
Download presentation
Discussion Class 5 Stemming Algorithms 1

Discussion Class 5 Stemming Algorithms 1

Discussion Classes Format: Question Ask a member of the class to answer Provide opportunity

Discussion Classes Format: Question Ask a member of the class to answer Provide opportunity for others to comment When answering: Give your name. Make sure that the TA hears it. Stand up Speak clearly so that all the class can hear 2

Question 1: Conflation methods (a) Define the terms: stem, suffix, prefix, conflation, morpheme (b)

Question 1: Conflation methods (a) Define the terms: stem, suffix, prefix, conflation, morpheme (b) Define the terms in the following diagram: Conflation methods Manual Automatic (stemmers) Affix removal Longest match 3 Successor variety Simple removal Table lookup n-gram

Question 2: Table look-up (a) What are the advantages and disadvantages of table look-up

Question 2: Table look-up (a) What are the advantages and disadvantages of table look-up methods? (b) When would you use table look-up? 4

Question 3: Successor variety methods Hafer and Weiss defined their technique as: Let be

Question 3: Successor variety methods Hafer and Weiss defined their technique as: Let be a word of length n, i is a length i prefix of . Let D be the corpus of words. D i is defined as the subset of D containing the terms whose first i letters match i exactly. The successor variety of i, denoted by S i, is then defined as the number of letters that occupy the i+1 st position of words in D i. A test word of length n has n successor varieties S i, . . . , S i. Explain this definition, using the word "computation" as an example. 5

Question 4: Successor variety methods With successor variety methods, how do the following methods

Question 4: Successor variety methods With successor variety methods, how do the following methods of segmentation work? (a) cutoff method (b) peak and plateau method (c) complete word method 6

Question 5: n-gram methods (a) Explain the following notation: statistics => st ta at

Question 5: n-gram methods (a) Explain the following notation: statistics => st ta at ti is st ti ic cs unique diagrams =>at cs ic is st ta ti statistical => st ta at ti is st ti ic ca al unique diagrams => al at ca ic is st ta ti (b) Calculate the similarity using Dice's coefficient: 2 C S= A+B A is the number of unique diagrams in the first term B is the number of unique diagrams in the second term C is the number of shared unique diagrams 7 (c) How would you use this approach for stemming?

Question 6: Porter's algorithm (a) What is an iterative, longest match stemmer? (b) How

Question 6: Porter's algorithm (a) What is an iterative, longest match stemmer? (b) How is longest match achieved in the Porter algorithm? 8

Question 7: Porter's algorithm Conditions Suffix Replacement Examples (m > 0) eed ee feed

Question 7: Porter's algorithm Conditions Suffix Replacement Examples (m > 0) eed ee feed -> feed agreed -> agree (*v*) ed null plastered -> plaster bled -> bled (*v*) ing null motoring -> motor sing -> sing (a) Explain this table (b) How does this table apply to: "exceeding", "ringed"? 9

Question 8: Evaluation (a) What is the overall effectiveness of stemming? (b) Give a

Question 8: Evaluation (a) What is the overall effectiveness of stemming? (b) Give a possible reason why Stemmer A might be better than Stemmer B on Collection X but worse on Collection Y. 10