CSCE 771 Natural Language Processing Lecture 12 Classifiers

Overview Last Time n Confusion Matrix Brill Demo n NLTK Ch 6 - Text

Evaluation of classifiers again Last time Recall Precision F value Accuracy – 3– CSCE

Reuters Data set 21578 documents 118 categories document can be in multiple classes 118

Confusion matrix Cij – documents that are really Ci that are classified as Cj.

Micro averaging vs Macro Averaging – average performance of individual classifiers (average of averages)

Training, Development and Test Sets – 7– CSCE 771 Spring 2013

Code_consecutive_pos_tagger. py revisited to trace history development def pos_features(sentence, i, history): # [_consec-pos-tag-features] if

Trace of one sentence - SIGINT to interrupt sentence= ['Rookie', 'southpaw', 'George', 'Stepanovich', 'relieved',

$Trace continued pos_features ******************* sentence= ['Rookie', …'. '] i= 1 history= ['NN'] pos_features= {'suffix(3)':$

nltk. tag Classes Affix. Tagger Bigram. Tagger Brill. Tagger. Trainer Default. Tagger Fast. Brill.

Module nltk. tag. hmm Source Code for Module nltk. tag. hmm import nltk. tag.

HMM demo import nltk. tag. hmm. demo() nltk. tag. hmm. demo_pos_bw() – 13 –

Common Suffixes from nltk. corpus import brown suffix_fdist = nltk. Freq. Dist() for word

rtepair = nltk. corpus. rte. pairs(['rte 3_dev. xml'])[33] extractor = nltk. RTEFeature. Extractor(rtepair) print

tagged_sents = list(brown. tagged_sents(categories='news')) random. shuffle(tagged_sents) size = int(len(tagged_sents) * 0. 1) train_set, test_set

Traceback (most recent call last): File "C: UsersmmmDocumentsCourses771Python 771 ch 06ch 06 d. py",

from nltk. corpus import brown_tagged_sents = brown. tagged_sents(categories='news') size = int(len(brown_tagged_sents) * 0. 9)

def tag_list(tagged_sents): return [tag for sent in tagged_sents for (word, tag) in sent] def

Entropy import math def entropy(labels): freqdist = nltk. Freq. Dist(labels) probs = [freqdist. freq(l)

print entropy(['male', 'male']) -0. 0 print entropy(['male', 'female', 'male']) 0. 811278124459 print entropy(['female', 'male'])

The Rest of NLTK Chapter 06 6. 5 Naïve Bayes Classifiers 6. 6 Maximum

Maximum Entropy Models (again) features are elements of evidence that connect observations d with

Feature-Based Liner Classifiers p(c |d, lambda)= – 25 – CSCE 771 Spring 2013

Maxent Model revisited – 26 – CSCE 771 Spring 2013

Maximum Entropy Markov Models (MEMM) repeatedly use Maxent classifier to iteratively apply to a

Named Entity Recognition (NER) enities – 1. a : being, existence; especially : independent,

Classes of Named Entities Person (PERS) Location (LOC) Organization (ORG) DATE Example: Jim bought

IOB tagging B – beginning a chunk, e. g. , B LOC I –

Chunking - partial parsing – 33 – CSCE 771 Spring 2013

NLTK ch 07. py def ie_preprocess(document): sentences = nltk. sent_tokenize(document) sentences = [nltk. word_tokenize(sent)

(S (NP the/DT little/JJ yellow/JJ dog/NN) barked/VBD at/IN (NP the/DT cat/NN)) (S (NP money/NN

$chunkex-draw grammar = "NP: {<DT>? <JJ>*<NN>}" # [_chunkex-grammar] cp = nltk. Regexp. Parser(grammar) #$

Chunk two consecutive nouns = [("money", "NN"), ("market", "NN"), ("fund", "NN")] grammar = "NP:

cp = nltk. Regexp. Parser('CHUNK: {<V. *> <TO> <V. *>}') brown = nltk. corpus.

nltk. chunk. accuracy example from nltk. corpus import conll 2000 test_sents = conll 2000.

First attempt ? !? from nltk. corpus import conll 2000 cp = nltk. Regexp.

from nltk. corpus import conll 2000 test_sents = conll 2000. chunked_sents('test. txt', chunk_types=['NP']) print

Chunking using connll 2000 text = ''' he PRP B-NP accepted VBD B-VP the

. (S (PP Over/IN) (NP a/DT cup/NN) (PP of/IN) (NP coffee/NN) , /, (NP

A Real Attempt grammar = r"NP: {<[CDJNP]. *>+}" cp = nltk. Regexp. Parser(grammar) print

Information extraction Step towards understanding Find named entities Figure out what is being said

Outline of natural language processing 1 What is NLP ? 7 General NLP concepts

Persons influential in NLP Alan Turing – originator of the Turing Test. Noam Chomsky

Slides: 47

Download presentation

CSCE 771 Natural Language Processing Lecture 12 Classifiers Part 2 Topics n n Classifiers Maxent Classifiers Maximum Entropy Markov Models Information Extraction and chunking intro Readings: Chapter 6, 7. 1 February 25, 2013

Overview Last Time n Confusion Matrix Brill Demo n NLTK Ch 6 - Text Classification n Today n n n Confusion Matrix Brill Demo NLTK Ch 6 - Text Classification Readings n – 2– NLTK Ch 6 CSCE 771 Spring 2013

Evaluation of classifiers again Last time Recall Precision F value Accuracy – 3– CSCE 771 Spring 2013

Reuters Data set 21578 documents 118 categories document can be in multiple classes 118 binary classifiers – 4– CSCE 771 Spring 2013

Confusion matrix Cij – documents that are really Ci that are classified as Cj. Cii – documents that are really Ci that correctly classified – 5– CSCE 771 Spring 2013

Micro averaging vs Macro Averaging – average performance of individual classifiers (average of averages) Micro averaging sum up all correct and all fp and fn – 6– CSCE 771 Spring 2013

Training, Development and Test Sets – 7– CSCE 771 Spring 2013

Code_consecutive_pos_tagger. py revisited to trace history development def pos_features(sentence, i, history): # [_consec-pos-tag-features] if debug == 1 : print "pos_features *****************" if debug == 1 : print " sentence=", sentence if debug == 1 : print " i=", i if debug == 1 : print " history=", history features = {"suffix(1)": sentence[i][-1: ], "suffix(2)": sentence[i][-2: ], "suffix(3)": sentence[i][-3: ]} if i == 0: features["prev-word"] = "<START>" features["prev-tag"] = "<START>" else: features["prev-word"] = sentence[i-1] features["prev-tag"] = history[i-1] if debug == 1 : print "pos_features=", features – 8 – return features CSCE 771 Spring 2013

Trace of one sentence - SIGINT to interrupt sentence= ['Rookie', 'southpaw', 'George', 'Stepanovich', 'relieved', 'Hyde', 'at', 'the', 'start', 'of', 'the', 'ninth', 'and', 'gave', 'up', 'the', "A's", 'fifth', 'tally', 'on', 'a', 'walk', 'to', 'second', 'baseman', 'Dick', 'Howser', ', ', 'a', 'wild', 'pitch', ', ', 'and', 'Frank', "Cipriani's", 'single', 'under', 'Shortstop', 'Jerry', "Adair's", 'glove', 'into', 'center', '. '] i= 0 history= [ ] pos_features= {'suffix(3)': 'kie', 'prev-word': '<START>', 'suffix(2)': 'ie', 'prev-tag': '<START>', 'suffix(1)': 'e'} – 9– CSCE 771 Spring 2013

$Trace continued pos_features ******************* sentence= ['Rookie', …'. '] i= 1 history= ['NN'] pos_features= {'suffix(3)':$

Trace continued pos_features ******************* sentence= ['Rookie', …'. '] i= 1 history= ['NN'] pos_features= {'suffix(3)': 'paw', 'prev-word': 'Rookie', 'suffix(2)': 'aw', 'prev-tag': 'NN', 'suffix(1)': 'w'} pos_features ******************* sentence= ['Rookie', 'southpaw', … '. '] i= 2 history= ['NN', 'NN'] pos_features= {'suffix(3)': 'rge', 'prev-word': 'southpaw', 'suffix(2)': 'ge', 'prev-tag': 'NN', 'suffix(1)': 'e'} – 10 – CSCE 771 Spring 2013

nltk. tag Classes Affix. Tagger Bigram. Tagger Brill. Tagger. Trainer Default. Tagger Fast. Brill. Tagger. Trainer Hidden. Markov. Model. Tagger Hidden. Markov. Model. Trainer Ngram. Tagger Regexp. Tagger. I Trigram. Tagger Unigram. Tagger – 11 – Functions batch_pos_tag untag CSCE 771 Spring 2013

Module nltk. tag. hmm Source Code for Module nltk. tag. hmm import nltk. tag. hmm. demo() nltk. tag. hmm. demo_pos_bw() – 12 – CSCE 771 Spring 2013

HMM demo import nltk. tag. hmm. demo() nltk. tag. hmm. demo_pos_bw() – 13 – CSCE 771 Spring 2013

Common Suffixes from nltk. corpus import brown suffix_fdist = nltk. Freq. Dist() for word in brown. words(): word = word. lower() suffix_fdist. inc(word[-1: ]) suffix_fdist. inc(word[-2: ]) suffix_fdist. inc(word[-3: ]) common_suffixes = suffix_fdist. keys()[: 100] print common_suffixes – 14 – CSCE 771 Spring 2013

rtepair = nltk. corpus. rte. pairs(['rte 3_dev. xml'])[33] extractor = nltk. RTEFeature. Extractor(rtepair) print extractor. text_words set(['Russia', 'Organisation', 'Shanghai', … print extractor. hyp_words set(['member', 'SCO', 'China']) print extractor. overlap('word') set([ ]) print extractor. overlap('ne') set(['SCO', 'China']) print extractor. hyp_extra('word') set(['member']) – 15 – CSCE 771 Spring 2013

tagged_sents = list(brown. tagged_sents(categories='news')) random. shuffle(tagged_sents) size = int(len(tagged_sents) * 0. 1) train_set, test_set = tagged_sents[size: ], tagged_sents[: size] file_ids = brown. fileids(categories='news') size = int(len(file_ids) * 0. 1) train_set = brown. tagged_sents(file_ids[size: ]) test_set = brown. tagged_sents(file_ids[: size]) train_set = brown. tagged_sents(categories='news') test_set = brown. tagged_sents(categories='fiction') classifier = nltk. Naive. Bayes. Classifier. train(train_set) CSCE 771 Spring 2013 – 16 –

Traceback (most recent call last): File "C: UsersmmmDocumentsCourses771Python 771 ch 06ch 06 d. py", line 80, in <module> classifier = nltk. Naive. Bayes. Classifier. train(train_set) File "C: Python 27libsitepackagesnltkclassifynaivebayes. py", line 191, in train for featureset, label in labeled_featuresets: Value. Error: too many values to unpack – 17 – CSCE 771 Spring 2013

from nltk. corpus import brown_tagged_sents = brown. tagged_sents(categories='news') size = int(len(brown_tagged_sents) * 0. 9) train_sents = brown_tagged_sents[: size] test_sents = brown_tagged_sents[size: ] t 0 = nltk. Default. Tagger('NN') t 1 = nltk. Unigram. Tagger(train_sents, backoff=t 0) t 2 = nltk. Bigram. Tagger(train_sents, backoff=t 1) – 18 – CSCE 771 Spring 2013

def tag_list(tagged_sents): return [tag for sent in tagged_sents for (word, tag) in sent] def apply_tagger(tagger, corpus): return [tagger. tag(nltk. tag. untag(sent)) for sent in corpus] gold = tag_list(brown. tagged_sents(categories='editorial')) test = tag_list(apply_tagger(t 2, brown. tagged_sents(categories='editorial'))) cm = nltk. Confusion. Matrix(gold, test) print cm. pp(sort_by_count=True, show_percents=True, truncate=9) – 19 – CSCE 771 Spring 2013

Entropy import math def entropy(labels): freqdist = nltk. Freq. Dist(labels) probs = [freqdist. freq(l) for l in nltk. Freq. Dist(labels)] return -sum([p * math. log(p, 2) for p in probs]) – 21 – CSCE 771 Spring 2013

print entropy(['male', 'male']) -0. 0 print entropy(['male', 'female', 'male']) 0. 811278124459 print entropy(['female', 'male']) 1. 0 print entropy(['female', 'male', 'female']) 0. 811278124459 print entropy(['female', 'female']) -0. 0 – 22 – CSCE 771 Spring 2013

The Rest of NLTK Chapter 06 6. 5 Naïve Bayes Classifiers 6. 6 Maximum Entropy Classifiers • nltk. classify. maxent. Binary. Maxent. Feature. Encoding(l abels, mapping, unseen_features=False, alwayson_features=False) 6. 7 Modeling Linguistic Patterns 6. 8 Summary But no more Code? !? – 23 – CSCE 771 Spring 2013

Maximum Entropy Models (again) features are elements of evidence that connect observations d with categories c f: C X D R Example feature f(c, d) = { c = LOCATION & w-1 = IN & is Capitalized(w)} An “input-feature” is a property of an unlabeled token. A “joint-feature” is a property of a labeled token. – 24 – CSCE 771 Spring 2013

Feature-Based Liner Classifiers p(c |d, lambda)= – 25 – CSCE 771 Spring 2013

Maxent Model revisited – 26 – CSCE 771 Spring 2013

Maximum Entropy Markov Models (MEMM) repeatedly use Maxent classifier to iteratively apply to a sequence – 27 – CSCE 771 Spring 2013

– 28 – CSCE 771 Spring 2013

Named Entity Recognition (NER) enities – 1. a : being, existence; especially : independent, separate, or self-contained existence b : the existence of a thing as contrasted with its attributes 2. : something that has separate and distinct existence and objective or conceptual reality 3. : an organization (as a business or governmental unit) that has an identity separate from those of its members one of those with a name • http: //nlp. stanford. edu/software/CRF-NER. shtml – 29 – CSCE 771 Spring 2013

Classes of Named Entities Person (PERS) Location (LOC) Organization (ORG) DATE Example: Jim bought 300 shares of Acme Corp. in 2006. And producing an annotated block of text, such as this one: <ENAMEX TYPE="PERSON">Jim</ENAMEX> bought <NUMEX TYPE="QUANTITY">300</NUMEX>shares of<ENAMEX TYPE="ORGANIZATION">Acme Corp. </ENAMEX> in <TIMEX TYPE="DATE">2006</TIMEX>. – 30 – http: //nlp. stanford. edu/software/CRF-NER. shtml CSCE 771 Spring 2013

IOB tagging B – beginning a chunk, e. g. , B LOC I – in a chunk O – outside chunk – 31 – Example text = ''' he PRP B-NP accepted VBD B-VP the DT B-NP position NN I-NP of IN B-PP vice NN B-NP chairman NN I-NP , , O CSCE 771 Spring 2013

. – 32 – CSCE 771 Spring 2013

Chunking - partial parsing – 33 – CSCE 771 Spring 2013

NLTK ch 07. py def ie_preprocess(document): sentences = nltk. sent_tokenize(document) sentences = [nltk. word_tokenize(sent) for sent in sentences] sentences = [nltk. pos_tag(sent) for sent in sentences] sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), # [_chunkex-sent] ("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")] grammar = "NP: {<DT>? <JJ>*<NN>}" # [_chunkex-grammar] cp = nltk. Regexp. Parser(grammar) result = cp. parse(sentence) print result – 34 – CSCE 771 Spring 2013

(S (NP the/DT little/JJ yellow/JJ dog/NN) barked/VBD at/IN (NP the/DT cat/NN)) (S (NP money/NN market/NN) fund/NN) – 35 – CSCE 771 Spring 2013

$chunkex-draw grammar = "NP: {<DT>? <JJ>*<NN>}" # [_chunkex-grammar] cp = nltk. Regexp. Parser(grammar) #$

chunkex-draw grammar = "NP: {<DT>? <JJ>*<NN>}" # [_chunkex-grammar] cp = nltk. Regexp. Parser(grammar) # [_chunkex-cp] result = cp. parse(sentence) # [_chunkex-test] print result # [_chunkex-print] result. draw() – 36 – CSCE 771 Spring 2013

Chunk two consecutive nouns = [("money", "NN"), ("market", "NN"), ("fund", "NN")] grammar = "NP: {<NN>} # Chunk two consecutive nouns" cp = nltk. Regexp. Parser(grammar) print cp. parse(nouns) (S (NP money/NN market/NN) fund/NN) – 37 – CSCE 771 Spring 2013

cp = nltk. Regexp. Parser('CHUNK: {<V. *> <TO> <V. *>}') brown = nltk. corpus. brown for sent in brown. tagged_sents(): tree = cp. parse(sent) for subtree in tree. subtrees(): if subtree. node == 'CHUNK': print subtree (CHUNK combined/VBN to/TO achieve/VB) … (CHUNK serve/VB to/TO protect/VB) (CHUNK wanted/VBD to/TO wait/VB) … – 38 – CSCE 771 Spring 2013

nltk. chunk. accuracy example from nltk. corpus import conll 2000 test_sents = conll 2000. chunked_sents('test. txt', chunk_types=['NP']) print nltk. chunk. accuracy(cp, test_sents) 0. 41745994892 – 39 – CSCE 771 Spring 2013

First attempt ? !? from nltk. corpus import conll 2000 cp = nltk. Regexp. Parser("") test_sents = conll 2000. chunked_sents('test. txt', chunk_types=['NP']) print cp. evaluate(test_sents) Chunk. Parse score: IOB Accuracy: 43. 4% Precision: 0. 0% Recall: 0. 0% F-Measure: 0. 0% – 40 – CSCE 771 Spring 2013

from nltk. corpus import conll 2000 test_sents = conll 2000. chunked_sents('test. txt', chunk_types=['NP']) print nltk. chunk. accuracy(cp, test_sents) 0. 41745994892 Carlyle NNP B-NP Group NNP I-NP , , O – 41 – CSCE 771 Spring 2013

Chunking using connll 2000 text = ''' he PRP B-NP accepted VBD B-VP the DT B-NP position NN I-NP of IN B-PP vice NN B-NP chairman NN I-NP …. . O ''' – 42 – nltk. chunk. conllstr 2 tree( text, chunk_types=['NP']). draw() from nltk. corpus import conll 2000 print conll 2000. chunked_sents('train. tx t')[99] CSCE 771 Spring 2013

. (S (PP Over/IN) (NP a/DT cup/NN) (PP of/IN) (NP coffee/NN) , /, (NP Mr. /NNP Stone/NNP) (VP told/VBD) (NP his/PRP$ story/NN). /. ) – 43 – CSCE 771 Spring 2013

A Real Attempt grammar = r"NP: {<[CDJNP]. *>+}" cp = nltk. Regexp. Parser(grammar) print cp. evaluate(test_sents) Chunk. Parse score: IOB Accuracy: 87. 7% Precision: 70. 6% Recall: 67. 8% F-Measure: 69. 2% – 44 – CSCE 771 Spring 2013

Information extraction Step towards understanding Find named entities Figure out what is being said about them; actually just relations of named entities – 45 – http: //en. wikipedia. org/wiki/Information_extraction CSCE 771 Spring 2013

Outline of natural language processing 1 What is NLP ? 7 General NLP concepts 2 Prerequisite technologies 8 NLP software 3 Subfields of NLP 4 Related fields 5 Processes of NLP: Applications, Components 6 History of NLP 6. 1 Timeline of NLP software • 8. 1 Chatterbots • 8. 2 NLP toolkits • 8. 3 Translation software 9 NLP organizations 10 NLP publications: Books, Journals 11 Persons 12 See also 13 References 14 External links http: //en. wikipedia. org/wiki/Outline_of_natural_language_processing – 46 – CSCE 771 Spring 2013

Persons influential in NLP Alan Turing – originator of the Turing Test. Noam Chomsky – author of the seminal work Syntactic Structures, which revolutionized Linguistics with 'universal grammar', a rule based system of syntactic structures. [15] Daniel Bobrow – Terry Winograd – Kenneth Colby – Rollo Carpenter – David Ferrucci – principal investigator of the team that created Watson, IBM's AI computer that won the quiz show Jeopardy! William Aaron Woods Joseph Weizenbaum – author of the ELIZA chatterbot. Roger Schank – introduced the conceptual dependency theory for natural language understanding. [16] – 47 – CSCE 771 Spring 2013