Natural Language Processing Vasile Rus http www cs

  • Slides: 49
Download presentation
Natural Language Processing Vasile Rus http: //www. cs. memphis. edu/~vrus/teaching/nlp

Natural Language Processing Vasile Rus http: //www. cs. memphis. edu/~vrus/teaching/nlp

Outline • Announcements • Word-level Processing • Stemming

Outline • Announcements • Word-level Processing • Stemming

Announcements • Paper presentations (Ph. D students) • Project

Announcements • Paper presentations (Ph. D students) • Project

Language • Language = words grouped according to some rules called a grammar Language

Language • Language = words grouped according to some rules called a grammar Language = words + rules Rules are too flexible for system developers Rules are not flexible enough for poets 4

Language • Dictionary/Lexicon – set of words defined in the language – open (dynamic)

Language • Dictionary/Lexicon – set of words defined in the language – open (dynamic) – entries in lexicon are called lemma • Grammar – set of rules which describe what is allowable in a language – Classic Grammars • meant for humans who know the language • definitions and rules are mainly supported by examples • no (or almost no) formal description tools; cannot be programmed – Explicit Grammar (CFG, Dependency Grammars, . . . ) • formal description • can be programmed & tested on data (texts)

Levels of (Formal) Description • Speech/Character Recognition – Speech: Phonetics and Phonology handle words

Levels of (Formal) Description • Speech/Character Recognition – Speech: Phonetics and Phonology handle words • Morphology • • Syntax Semantics Pragmatics Discourse handle sounds/ graphics handle rules for grouping words in legal language constructs 6

NLP Pipeline speech text Phonetic Analysis Character Recognition Morphological analysis Syntactic analysis Semantic Interpretation

NLP Pipeline speech text Phonetic Analysis Character Recognition Morphological analysis Syntactic analysis Semantic Interpretation Discourse Processing This Lecture

Words and their Internal Affairs: Morphology • Words are grouped into classes/ grammatical categories/

Words and their Internal Affairs: Morphology • Words are grouped into classes/ grammatical categories/ syntactic categories/parts-of-speech (POS) based – on their syntactic and morphological behavior • Noun: words that occur with determiners, take possessives, occur (most but not all) in plural form – and less on their typical semantic type • Luckily the classes are semantically coherent at some extent • A word belongs to a category if it passes the substitution test – The sad/intelligent/green/fat bug submerged in my soup. They all belong to the same class: ADJ

Words and their Internal Affairs: Morphology • Word categories are of two types: –

Words and their Internal Affairs: Morphology • Word categories are of two types: – Open categories: accept new members • • Nouns Verbs Adjectives Adverbs Any known human language has nouns and verbs! – Closed or functional categories • Almost fixed membership • Few members • Determiners, prepositions, pronouns, conjunctions, auxiliary verbs? , particles, numerals, etc. • Play an important role in grammar

Phonology and Morphology • Phonology: how sounds are realized in different contexts • Phoneme:

Phonology and Morphology • Phonology: how sounds are realized in different contexts • Phoneme: a set of closely related speech sounds (phones) regarded as a single sound. For example, the sound of "r" in red, bring, or round is a phoneme. • Historical spelling – night, nite – attention, mission, fish • Script Limitations – Spoken English has 14 vowels • heed hid hayed head hoed hood who’d hide how’d taught Tut toy enough – English Alphabet has 5 • Use vowel combinations: far fair fare • Consonantal doubling (hopping vs. hoping)

Syntax and Morphology • Phrase-level agreement – Subject-Verb • John studies hard (STUDY+3 SG)

Syntax and Morphology • Phrase-level agreement – Subject-Verb • John studies hard (STUDY+3 SG) – Noun-Adjective • Las vacas hermosas

Morphology: Morphemes • Studies how words are built up from morpheme, the minimal meaning

Morphology: Morphemes • Studies how words are built up from morpheme, the minimal meaning bearing unit – Example: foxes - fox + es – could as well be some “ID” numbers: • e. g. fox ~ 2327, es ~ 1278 • Two broad classes of morphemes: – Stem: the main morpheme of a word (supplies the main meaning) – Affixes: provide additional meanings • • prefixes: precede the stem suffixes: follow the stem infixes: inside the stem circumfixes: precede and follow the stem

Morphology • Morpheme combine according to some fixed rules – Concatenative morphology • Prefixes

Morphology • Morpheme combine according to some fixed rules – Concatenative morphology • Prefixes and suffixes • Morphemes combined in complex ways – Non-concatenative/templatic morphology • Morphology can be divided up into two broad classes – Inflectional: – Derivational

Inflectional Morphology • Inflectional morphology concerns the combination of stems and affixes where the

Inflectional Morphology • Inflectional morphology concerns the combination of stems and affixes where the resulting word – Has the same word class as the original – Serves a grammatical/semantic purpose different from the original

Nouns (English) • Nouns are simple (not really) – Markers for plural and possessive

Nouns (English) • Nouns are simple (not really) – Markers for plural and possessive • Plural: – Regular plural: affix –s or -es • Cat – cats • Thrush – thrushes – Irregular plural: • Mouse – mice • Ox - oxen

Verbs • Only main and primary (be, have, do) verbs have inflectional affixes (modals

Verbs • Only main and primary (be, have, do) verbs have inflectional affixes (modals don’t have) – Regular • Walk, walks, walking, walked – Irregular • Eat, eats, eating, ate, eaten • Catch, catches, catching, caught • Cut, cuts, cutting, cut

Regular Verbs Morphological Form Classes Regularly Inflected Verbs Stem Walk Merge Try Map -s

Regular Verbs Morphological Form Classes Regularly Inflected Verbs Stem Walk Merge Try Map -s form Walks Merges Tries Maps -ing participle Walking Merging Trying Mapping Past form or –ed participle Walked Merged Tried mapped

Irregular Verbs Morphological Form Classes Irregularly Inflected Verbs Stem Eat Catch Cut -s form

Irregular Verbs Morphological Form Classes Irregularly Inflected Verbs Stem Eat Catch Cut -s form Eats Catches Cuts -ing participle Eating Catching Cutting Past form Ate Caught Cut –ed participle Eaten Caught cut

Derivational Morphology • Combination of a word stem with a grammatical morpheme • Results

Derivational Morphology • Combination of a word stem with a grammatical morpheme • Results in a word of a different class • Derivational morphology is complex in English – Quasi-systematicity – Irregular meaning change – Changes of word class • Nominalisation – Computerize + ation - computerization

Derivational Examples • Verb/Adj to Noun -ation computerize computerization -ee -er -ness appoint kill

Derivational Examples • Verb/Adj to Noun -ation computerize computerization -ee -er -ness appoint kill fuzzy appointee killer fuzziness

Derivational Examples • Noun/Verb to Adj -al Computational -able Embraceable -less Clueless

Derivational Examples • Noun/Verb to Adj -al Computational -able Embraceable -less Clueless

Compute • Many paths are possible… • Start with compute – Computer -> computerize

Compute • Many paths are possible… • Start with compute – Computer -> computerize -> computerization – Computation -> computational – Computer -> computerize -> computerizable – Compute -> computee

Computational Morphology • Morphological Parsing – Maps surface representation to lexicon entries – Finite

Computational Morphology • Morphological Parsing – Maps surface representation to lexicon entries – Finite State Morphology • Finite State Transducers (FST) • Input/Output • Analysis/Generation

Computational Morphology WORD • Cats • cat • cities • geese • ducks STEM

Computational Morphology WORD • Cats • cat • cities • geese • ducks STEM (+FEATURES)* cat +N +PL cat +N +SG city +N +PL goose +N +PL (duck +N +PL) or (duck +V +3 SG) • merging merge +V +PRES-PART • caught (catch +V +PAST-PART) or (catch +V +PAST)

Computational Morphology • The Rules and the Lexicon – – General versus Specific Regular

Computational Morphology • The Rules and the Lexicon – – General versus Specific Regular versus Irregular Accuracy, speed, space The Morphology of a language • Approaches – Lexicon only – Rules only – Lexicon and Rules • Finite-state Transducers: a Finite-state Automata that maps between lexical and surface level

Lexicon-only Morphology • The lexicon lists all surface level and lexical level pairs •

Lexicon-only Morphology • The lexicon lists all surface level and lexical level pairs • No rules …? • Analysis/Generation is easy • Very large for English • What about Arabic or Turkish? • Chinese? acclaimed acclaiming acclaims acclamations acclimated acclimates acclimating acclaim $N$ acclaim $V+0$ acclaim $V+ed$ acclaim $V+en$ acclaim $V+ing$ acclaim $N+s$ acclaim $V+s$ acclamation acclimate acclimate $N$ $N+s$ $V+0$ $V+ed$ $V+en$ $V+s$ $V+ing$

Lexicon and Rules FSA Inflectional Morphology • English Noun Lexicon reg-noun Irreg-pl-noun Irreg-sg-noun plural

Lexicon and Rules FSA Inflectional Morphology • English Noun Lexicon reg-noun Irreg-pl-noun Irreg-sg-noun plural fox cat dog geese sheep mice goose sheep mouse -s • English Noun Rule

Rules Only: Stemming • Sometimes you just need to know the stem of a

Rules Only: Stemming • Sometimes you just need to know the stem of a word and you don’t care about the structure • In fact you may not even care if you get the right stem, as long as you get a consistent string • This is stemming… it most often shows up in IR (Information Retrieval) applications • IR: the task of locating most relevant documents in a collection given a query as a set of keywords

Stemming in IR • Run a stemmer on the documents to be indexed •

Stemming in IR • Run a stemmer on the documents to be indexed • Run a stemmer on users queries • Match

Porter Stemmer • • No lexicon needed Basically a set of staged sets of

Porter Stemmer • • No lexicon needed Basically a set of staged sets of rewrite rules that strip suffixes Handles both inflectional and derivational suffixes Doesn’t guarantee that the resulting stem is really a stem – Lack of guarantee doesn’t matter for IR – In a test on the well-known Cranfield 200 collection it gave an improvement in retrieval performance when compared with a very much more elaborate program which has been in use in IR research in Cambridge since 1971 • DAWSON, J. L. Suffix Removal and Word Conflation. ALLC Bulletin, Michaelmas 1974 p. 33 -46 • ANDREWS, K. The Development of a Fast Conflation Algorithm for English. Dissertation for the Diploma in Computer Science, Computer Laboratory, University of Cambridge, 1971

Porter Stemmer: Examples • • • • wearable wearer wearied wearier weariest wearily weariness

Porter Stemmer: Examples • • • • wearable wearer wearied wearier weariest wearily weariness wearing wearisomely wears weathercocks wearabl wearer wearier weariest wearili wearisom wear weathercock • • • • web Webber webs Websterville wedded weddings wedged wedges wedging webber webstervil wedd wedg

Definitions • C = consonant = Not A E I O U or (Y

Definitions • C = consonant = Not A E I O U or (Y preceded by consonant) • V = not C • Every word (or part of a word) has one of those forms – – CVCV … C CVCV … V VCVC … C VCVC … V • C* means zero or more consonants; similarly V* • C+ means one or more consonants; similarly V+ • M = Measure of sequences of V+C+: Words = C*(V+C+){M}V* M=0 M=1 M=2 TR, EE, TREE, Y, BY TROUBLE, OATS, TREES, IVY TROUBLES, PRIVATE, OATEN, ORRERY

Definitions • Conditions *S - stem ends with S - (and similarly for the

Definitions • Conditions *S - stem ends with S - (and similarly for the other letters) *v* - stem contains a V *d - stem ends with double C e. g. -DD, -ZZ *o - stem ends CVC, where the second C is not W, X or Y e. g. -WIL, -SOB

Stemming Rules • The rules for removing a suffix will be given in the

Stemming Rules • The rules for removing a suffix will be given in the form (condition) S 1 -> S 2 • This means: if a word ends with the suffix S 1, and the stem before S 1 satisfies the given condition, S 1 is replaced by S 2 • The condition is usually given in terms of m (m > 1) EMENT -> Here S 1 is `EMENT' and S 2 is null. This would map REPLACEMENT to REPLAC, since REPLAC is a word part for which m = 2. • the condition part may also contain expressions with and, or and not – (m>1 and (*S or *T)) tests for a stem with m>1 ending in S or T – (*d and not (*L or *S or *Z)) tests for a stem ending with a double consonant other than L, S or Z

Stemming Rules • In a set of rules written beneath each other, only one

Stemming Rules • In a set of rules written beneath each other, only one is obeyed, and this will be the one with the longest matching S 1 for the given word • For example, with SSES -> SS IES -> I SS -> SS S -> CARESSES maps to CARESS since SSES is the longest match for S 1. Equally CARESS maps to CARESS (S 1=`SS') and CARES to CARE (S 1=`S')

Porter Stemmer Step 1: Plural Nouns and Third Person Singular Verbs SSES SS IES

Porter Stemmer Step 1: Plural Nouns and Third Person Singular Verbs SSES SS IES I *<S> = ends with <S> *v* = contains a V *d = ends with double C *o = ends with CVC second C is not W, X or Y caresses caress ponies poni ties ti caress cats cat SS S Step 2 a: Verbal Past Tense and Progressive Forms (M>0) EED EE i (*v*) ED ii (*v*) ING feed, agreed agree plastered motoring plaster, bled motor, sing Step 2 b: If 2 a. i or 2 a. ii is successful, Cleanup AT ATE conflat(ed) conflate BL BLE troubl(ed) trouble IZ IZE siz(ed) size (*d and not (*L or *S or *Z)) single letter (M=1 and *o) E hopp(ing) hop, tann(ed) tan hiss(ing) hiss, fizz(ed) fizz fail(ing) fail, fil(ing) file

Porter Stemmer *<S> = ends with <S> *v* = contains a V *d =

Porter Stemmer *<S> = ends with <S> *v* = contains a V *d = ends with double C *o = ends with CVC second C is not W, X or Y Step 3: Y I (*v*) Y I happy happi sky

Porter Stemmer *<S> = ends with <S> *v* = contains a V *d =

Porter Stemmer *<S> = ends with <S> *v* = contains a V *d = ends with double C *o = ends with CVC second C is not W, X or Y Step 4: Derivational Morphology I (m>0) ATIONAL -> ATE (m>0) TIONAL -> TION (m>0) ENCI -> ENCE (m>0) ANCI -> ANCE (m>0) IZER -> IZE (m>0) ABLI -> ABLE (m>0) ALLI -> AL (m>0) ENTLI -> ENT (m>0) ELI -> E (m>0) OUSLI -> OUS (m>0) IZATION -> IZE (m>0) ATION -> ATE (m>0) ATOR -> ATE (m>0) ALISM -> AL (m>0) IVENESS -> IVE (m>0) FULNESS -> FUL (m>0) OUSNESS -> OUS (m>0) ALITI -> AL (m>0) IVITI -> IVE (m>0) BILITI -> BLE relational -> relate conditional -> condition rational -> rational valenci -> valence hesitanci -> hesitance digitizer -> digitize conformabli -> conformable radicalli -> radical differentli -> different vileli - > vile analogousli -> analogous vietnamization -> vietnamize predication -> predicate operator -> operate feudalism -> feudal decisiveness -> decisive hopefulness -> hopeful callousness -> callous formaliti -> formal sensitiviti -> sensitive sensibiliti -> sensible

Porter Stemmer *<S> = ends with <S> *v* = contains a V *d =

Porter Stemmer *<S> = ends with <S> *v* = contains a V *d = ends with double C *o = ends with CVC second C is not W, X or Y Step 5: Derivational Morphology II: More Suffixes (m>0) ICATE -> IC (m>0) ATIVE -> (m>0) ALIZE -> AL (m>0) ICITI -> IC (m>0) ICAL -> IC (m>0) FUL -> (m>0) NESS -> triplicate -> triplic formative -> formalize -> formal electriciti -> electrical -> electric hopeful -> hope goodness -> good

Porter Stemmer *<S> = ends with <S> *v* = contains a V *d =

Porter Stemmer *<S> = ends with <S> *v* = contains a V *d = ends with double C *o = ends with CVC second C is not W, X or Y Step 5: Derivational Morphology III: Even More Suffixes (m>1) AL -> (m>1) ANCE -> (m>1) ER -> (m>1) IC -> (m>1) ABLE -> (m>1) IBLE -> (m>1) ANT -> (m>1) EMENT -> (m>1) ENT -> (m>1 and (*S or *T)) ION -> (m>1) OU -> (m>1) ISM -> (m>1) ATE -> (m>1) ITI -> (m>1) OUS -> (m>1) IVE -> (m>1) IZE -> revival -> reviv allowance -> allow inference -> infer airliner -> airlin gyroscopic -> gyroscop adjustable -> adjust defensible -> defens irritant -> irrit replacement -> replac adjustment -> adjust dependent -> depend adoption -> adopt homologou -> homolog communism -> commun activate -> activ angulariti -> angular homologous -> homolog effective -> effect bowdlerize -> bowdler

Porter Stemmer *<S> = ends with <S> *v* = contains a V *d =

Porter Stemmer *<S> = ends with <S> *v* = contains a V *d = ends with double C *o = ends with CVC second C is not W, X or Y Step 7 a: Cleanup (m>1) E (m=1 and not *o) E probate probat rate cease ceas Step 7 b: More Cleanup (m > 1 and *d and *L) single letter controll control roll

Some Insights • The algorithm is careful not to remove a suffix when the

Some Insights • The algorithm is careful not to remove a suffix when the stem is too short, the length of the stem being given by its measure, m • For example, in the following two lists: list A -----RELATE PROBATE CONFLATE PIRATE PRELATE list B -----DERIVATE ACTIVATE DEMONSTRATE NECESSITATE RENOVATE -ATE is removed from the list B words, but not from the list A words. • The fact that no attempt is made to identify prefixes can make the results look rather inconsistent – PRELATE does not lose the -ATE, but ARCHPRELATE becomes ARCHPREL – in practice this does not matter too much, because the presence of the prefix decreases the probability of an erroneous conflation

More Insights • Complex suffixes are removed bit by bit in the different steps

More Insights • Complex suffixes are removed bit by bit in the different steps – GENERALIZATIONS -> GENERALIZATION (Step 1) -> GENERALIZE (Step 2) > GENERAL (Step 3) -> GENER (Step 4) – OSCILLATORS -> OSCILLATOR (Step 1) -> OSCILLATE (Step 2) -> OSCILL (Step 4) -> OSCIL (Step 5) • In a vocabulary of 10, 000 words, the reduction in size of the stem was distributed among the steps as follows: Suffix stripping of a vocabulary of 10, 000 words ------------------------Number of words reduced in step 1: 3597 " 2: 766 " 3: 327 " 4: 2424 " 5: 1373 Number of words not reduced: 3650 The resulting vocabulary of stems contained 6370 distinct entries (reduced the size of the vocabulary by about one third).

Problems with Stemming Errors by the Porter Stemmer Organization Organ Doing Doe Generalization Generic

Problems with Stemming Errors by the Porter Stemmer Organization Organ Doing Doe Generalization Generic Numerical Numerous Policy Police University Universe Easy Easily Addition Additive Negligible Negligent Execute Executive Definite Past paste

Morphology: From Morphemes to Lemmas • Lemma : lexical unit, “pointer” to lexicon –

Morphology: From Morphemes to Lemmas • Lemma : lexical unit, “pointer” to lexicon – Set of lexical forms having same stem, same major part-of-speech, and same word sense – might as well be a number, but typically is represented as the “base form”, or “dictionary headword” • possibly indexed when ambiguous/polysemous: state 1 (verb), state 2 (state- of -affairs), state 3 (government)

Phones vs Phonemes A phone is … A phoneme is … One of many

Phones vs Phonemes A phone is … A phoneme is … One of many possible sounds in the languages of the world. A contrastive unit in the sound system of a particular language. The smallest identifiable unit found in a stream of speech. A minimal unit that serves to distinguish between meanings of words. Pronounced in a defined way. Pronounced in one or more ways, depending on the number of allophones. Represented between brackets by convention. Example: [b], [j], [o] Represented between slashes by convention. Example: /b/, /j/, /o/

Lexeme, Morpheme, Phoneme Syntax Lexeme/Inflected Lexeme Grammars sentences Morphology Morpheme/Allomorph Morphotactics words Phonology Phoneme/Allophone

Lexeme, Morpheme, Phoneme Syntax Lexeme/Inflected Lexeme Grammars sentences Morphology Morpheme/Allomorph Morphotactics words Phonology Phoneme/Allophone Phonotactics letters Lexeme: individual entry in lexicon Allomorph is a variant form of a morpheme. The meaning remains the same, while the sound can vary. (-ed in fished is [t], in buzzed is [d]) Allophone is one of several similar phones or speech sounds, that belong to the same phoneme. Each allophone is used in a specific phonetic context.

Summary • Morphology • Stemming – Porter’s Stemmer; see the link to the code

Summary • Morphology • Stemming – Porter’s Stemmer; see the link to the code on the class web page

Next Time • Part of Speech Tagging

Next Time • Part of Speech Tagging