CSA 305 NLP Algorithms Conflation Algorithms November 2003

CSA 305: NLP Algorithms Conflation Algorithms November 2003 CSA 3050 Conflation Algorithms 1

Acknowledgements • John Repici (2002) http: //www. creativyst. com/Doc/Articles/Sound. Ex 1/Sound Ex 1. htm • Porter, M. F. , 1980, An algorithm for suffix stripping, reprinted in Sparck Jones, Karen, and Peter Willet, 1997, Readings in Information Retrieval, San Francisco: Morgan Kaufmann, ISBN 1 -55860 -454 -4. [Vince has a copy of this] • Jurafsky & Martin appendix B pp 833 -836. November 2003 CSA 3050 Conflation Algorithms 2

Word Conflation Algorithms • Morphological analysis versus conflation • Notion of word class is application dependent – Geneology: Phonetic similarity – Information Retrieval: Semantic similarity • Soundex • Porter November 2003 CSA 3050 Conflation Algorithms 3

Problems with Names • Names can be misspelt: Rossner • Same name can be spelt in different ways Kirkop; Chircop • Same name appears differently in different cultures: Tchaikovsky; Chaicowski • To solve this problem, we need phonetically oriented algorithms which can find similar sounding terms and names. • Just such a family of algorithms exist and are called Sound. Exes, after the first patented version. November 2003 CSA 3050 Conflation Algorithms 4

The Soundex Algorithm • A Soundex algorithm takes a word as input and produces a character string which identifies a set of words that are (roughly) phonetically alike. • It is very handy for searching large databases • Originally developed by Margaret K. Odell and Robert C. Russell [cf. U. S. Patents 1261167 (1918), 1435663 (1922)], of the US Bureau of Archives, to simplify census-taking. • Don Knuth's implementation in his book "The Art of Computer Programming, vol. 3: Searching and Sorting, " the algorithm enjoyed a new popularity. November 2003 CSA 3050 Conflation Algorithms 5

Soundex Algorithm 1 The Soundex Algorithm uses the following steps to encode a word: 1. The first character of the word is retained as the first character of the Soundex code. 2. The following letters are discarded: a, e, i, o, u, h, w, and y. 3. If consonants having the same code number appear consecutively, the number will only be coded once. (e. g. "B 233" becomes "B 23") November 2003 CSA 3050 Conflation Algorithms 6

Code Numbers b, p, f, and v 1 c, s, k, g, j, q, x, z 2 d, t 3 l 4 m, n 5 r 6 November 2003 CSA 3050 Conflation Algorithms 7

Soundex Algorithm 2 – The resulting code is modified so that it becomes exactly four characters long: If it is less than 4 characters, zeroes are added to the end (e. g. "B 2" becomes "B 200") – If it is more than 4 characters, the code is truncated (e. g. "B 2435" becomes "B 243") November 2003 CSA 3050 Conflation Algorithms 8

Uses for the Soundex Code • Airline reservations - The soundex code for a passenger's surname is often recorded to avoid confusion when trying to pronounce it. • U. S. Census - As is noted above, the U. S. Census Department was a frequent user of the Soundex algorithm while trying to compile a listing of families around the turn of the century. • Geneology - In geneology, the Soundex code is most often used to avoid obstacles when dealing with names that might have alternate spellings. November 2003 CSA 3050 Conflation Algorithms 9

Improvements • Preprocessing before applying the basic algorithm, e. g. – – – DG with G GH with H GN with N (not 'ng') KN with N PH with F • Question: where to stop? November 2003 CSA 3050 Conflation Algorithms 10

IR Applications • Information Retrieval: Query → → Relevant Documents • “Bag of Terms” document model • What is a single term? November 2003 CSA 3050 Conflation Algorithms 11

Why Stemming is Necessary • Frequently we get collections of words of the following kind in the same document compute, computer, computing, computation, computability …. • Performance of IR system will be improved if all of these terms are conflated. – Less terms to worry about – More accurate statistics November 2003 CSA 3050 Conflation Algorithms 12

Issues • Is a dictionary available? – Stems – Affixes • Motivation: linguistic credibility or engineering performance? • When to remove a affix versus when to leave it alone • Porter (1980): W 1 and W 2 should be conflated if there appears to be no difference between the statements "this document is about W 1/W 2" relate/relativity vs. radioactive/radioactivity November 2003 CSA 3050 Conflation Algorithms 13

Consonants and Vowels • A consonant is a letter other than a, e, i, o, u and other than y preceded by a consonant: sky, toy • If a letter is not a consonant it is a vowel. • A sequence of consonants (cc. . c) or vowels (vv. . v) will be represented by C or V respectively. • For example the word troubles maps to C V C • Any word or part of a word, therefore has one of the following forms: (CV)n…. C (CV)n…. V (VC)n…. C (VC)n…. V November 2003 CSA 3050 Conflation Algorithms 14

Measure • All the above patterns can be replaced by the following regular expression (C) (VC)m (V) • m is called the measure of any word or word part. • m=0: tr, ee, tree, y, by m=1: trouble, oats, trees, ivy m=2: troubles; private November 2003 CSA 3050 Conflation Algorithms 15

Rules • Rules for removing a suffix are given in the form (condition) S 1 → S 2 • If a word ends with suffix S 1, and the stem before S 1 satisfies the condition, then it is replaced with S 2. Example (m > 1) EMENT → November 2003 CSA 3050 Conflation Algorithms 16

Conditions • • • *S - stem ends with s *Z - stem ends with z *T – stem ends with t *v* - stem contains a vowel *d - stem ends with a double consonant *o - stem ends cvc, where second c is not w, x or y e. g. –wil, -hop • In conditions, Boolean operators are possible e. g. (m>1 and (*S or *T)) • Sets of rules applied in 7 steps. Within each step, rule matching longest suffix applies. November 2003 CSA 3050 Conflation Algorithms 17

Organisation -s Step 1 Plurals and Third Person Singular Verbs -ed, -ing fly/flies Step 2 Verbal Past Tense and Progressive Step 3: Y to I Noun Inflections Steps 6 Derivational Morphology Single Suffixes Steps 4 and 5 Derivational Morphology Multiple Suffixes visualisation → visualise Step 7 Cleanup November 2003 CSA 3050 Conflation Algorithms 18

Step 1: Plural Nouns and 3 rd Person Singular Verbs condition November 2003 rewrite example SSES → SS caresses → caress IES →I ponies → poni SS → SS caress → caress S → cats → cat CSA 3050 Conflation Algorithms 19

Step 2 a Verbal Past Tense and Progressive Forms condition rewrite example (m>0) EED → EE (*v*) ED (*v*) ING feed → feed agreed → agree plastered → plaster bled → bled killing → kill sing → sing November 2003 →ε →ε CSA 3050 Conflation Algorithms 20

Step 2 b: Cleanup If 2 nd or 3 rd of last step succeeds condition rewrite AT → ATE generat → generate BL → BLE troubl → trouble IZ capsiz → capsize → IZE *d and not → (*L or *S single letter or *Z) November 2003 example hopping → hop hissing → hiss CSA 3050 Conflation Algorithms 21

Step 3: Y to I (*v*) Y→I November 2003 CSA 3050 Conflation Algorithms happy → happi cry → cry 22

STEP 4: Derivational Morphology 1 – Multiple Suffixes (excerpt) Condition Rewrite Example (m > 0) ATIONAL → ATE relational → relate (m > 0) TIONAL → TION conditional → condition (m > 0) ENCI → ENCE valenci → valence (m > 0) ABLI → ABLE comfortabli → comfortable (m > 0) OUSLI → OUS analagously → analagous (m > 0) IZATION → IZE digitizer → digitize (m > 0) ATION → ATE generation → generate (m > 0) ATOR → ATE operator → operate (m > 0) ALISM → AL formalism → formal (m > 0) IVENESS → IVE pensiveness → pensive (m > 0) FULNESS → FUL hopefulness → hopeful (m > 0) OUSNESS → OUS callousness → callous (m > 0) ALITI → AL formality → formal (m > 0) BILITI → BLE possibility → possible November 2003 CSA 3050 Conflation Algorithms 23

Step 6: Derivational Morphology III: Single Suffixes Condition Rewrite Example (m > 1) AL → ε revival → reviv (m > 1) ANCE → ε allowance → allow (m > 1) ENCE → ε inference → infer (m > 1) ER → ε airliner → airlin (m > 1) IC → ε Coptic → Copt (m > 1) ABLE → ε laughable → laugh (m > 1) ANT → ε irritant → irrit (m > 1) EMENT → ε replacement → replac (m > 1) MENT → ε adjustment → adjust (m > 1) ENT → ε dependent → depend (m > 0) (*S or *T) ION → ε adoption → adopt (m > 1) OU → ε callousness → callous (m > 1) ISM → ε formalism→ formal (m > 1) ATE → ε activate → activ November 2003 ITICSA 3050 → ε Conflation Algorithms 24

Porter Example • INPUT in the first focus area, integrated projects shall help develop, principally, common open platforms for software and services supporting a distributed information and decision systems for risk and crisis management November 2003 CSA 3050 Conflation Algorithms 25

Porter Output Original Word Stemmed Word first platforms platform focus focu software softwar area services servic integrated integr supporting support projects project distributed distribut help information inform develop decision decis principally princip systems system common risk open crisis crisi platforms platform management manag November 2003 CSA 3050 Conflation Algorithms 26
- Slides: 26