CS 4705 Corpus Linguistics and Machine Learning Techniques

  • Slides: 20
Download presentation
CS 4705 Corpus Linguistics and Machine Learning Techniques CS 4705

CS 4705 Corpus Linguistics and Machine Learning Techniques CS 4705

Review • What do we know about so far? – Words (stems and affixes,

Review • What do we know about so far? – Words (stems and affixes, roots and templates, …) – Ngrams (simple word sequences) – POS (e. g. nouns, verbs, adjectives, determiners, articles, …)

Some Additional Things We Could Find • Named Entities – Persons – Company Names

Some Additional Things We Could Find • Named Entities – Persons – Company Names – Locations – Dates

What useful things can we do with this knowledge? • Find sentence boundaries, abbreviations

What useful things can we do with this knowledge? • Find sentence boundaries, abbreviations • Find Named Entities (person names, company names, telephone numbers, addresses, …) • Find topic boundaries and classify articles into topics • Identify a document’s author and their opinion on the topic, pro or con • Answer simple questions (factoids) • Do simple summarization/compression

But first, we need corpora… • Online collections of text and speech • Some

But first, we need corpora… • Online collections of text and speech • Some examples – Brown Corpus – Wall Street Journal and AP News – ATIS, Broadcast News – TDTN – Switchboard, Call Home – TRAINS, FM Radio, BDC Corpus – Hansards’ parallel corpus of French and English – And many private research collections

Next, we pose a question…the dependent variable • Binary questions: – Is this word

Next, we pose a question…the dependent variable • Binary questions: – Is this word followed by a sentence boundary or not? – A topic boundary? – Does this word begin a person name? End one? – Should this word or sentence be included in a summary? • Classification: – Is this document about medical issues? Politics? Religion? Sports? … • Predicting continuous variables: – How loud or high should this utterance be produced?

Finding a suitable corpus and preparing it for analysis • Which corpora can answer

Finding a suitable corpus and preparing it for analysis • Which corpora can answer my question? – Do I need to get them labeled to do so? • Dividing the corpus into training and test corpora – To develop a model, we need a training corpus • overly narrow corpus: doesn’t generalize • overly general corpus: don't reflect task or domain – To demonstrate how general our model is, we need a test corpus to evaluate the model • Development test set vs. held out test set – To evaluate our model we must choose an evaluation metric • Accuracy • Precision, recall, F-measure, … • Cross validation

Then we build the model… • Identify the dependent variable: what do we want

Then we build the model… • Identify the dependent variable: what do we want to predict or classify? – Does this word begin a person name? Is this word within a person name? – Is this document about sports? The weather? International news? ? • Identify the independent variables: what features might help to predict the dependent variable? – What is this word’s POS? What is the POS of the word before it? After it? – Is this word capitalized? Is it followed by a ‘. ’? – Does ‘hocky’ appear in this document? – How far is this word from the beginning of its sentence? • Extract the values of each variable from the corpus by some automatic means

A Sample Feature Vector for Sentence-Ending Detection Word. ID POS Cap? , After? Dist/Sbeg

A Sample Feature Vector for Sentence-Ending Detection Word. ID POS Cap? , After? Dist/Sbeg End? Clinton N y n 1 n won V n n 2 n easily Adv n y 3 n but Conj n n 4 n

An Example: Finding Caller Names in Voicemail SCANMail • Motivated by interviews, surveys and

An Example: Finding Caller Names in Voicemail SCANMail • Motivated by interviews, surveys and usage logs of heavy users: – Hard to scan new msgs to find those you need to deal with quickly – Hard to find msg you want in archive – Hard to locate information you want in any msg • How could we help?

Caller SCANMail Architecture SCANMail Subscriber

Caller SCANMail Architecture SCANMail Subscriber

Corpus Collection • Recordings collected from 138 AT&T Labs employees’ mailboxes • 100 hours;

Corpus Collection • Recordings collected from 138 AT&T Labs employees’ mailboxes • 100 hours; 10 K msgs; 2500 speakers • Gender balanced: 12% non-native speakers • Mean message duration 36. 4 secs, median 30. 0 secs • Hand-transcribed annotated with caller id, gender, age, entity demarcation (names, dates, telnos) • Also recognized using ASR engine

Transcription and Bracketing [ Greeting: hi R ] [ Caller. ID: it's me ]

Transcription and Bracketing [ Greeting: hi R ] [ Caller. ID: it's me ] give me a call [ um ] right away cos there's [. hn ] I guess there's some [. hn ] change [ Date: tomorrow ] with the nursery school and they [ um ] [. hn ] anyway they had this idea [ cos ] since I think J's the only one staying [ Date: tomorrow ] for play club so they wanted to they suggested that [. hn ] well J 2 actually offered to take J home with her and then would she

would meet you back at the synagogue at [ Time: five thirty ] to

would meet you back at the synagogue at [ Time: five thirty ] to pick her up [. hn ] [ uh ] so I don't know how you feel about that otherwise M_ and one other teacher would stay and take care of her till [ Date: five thirty tomorrow ] but if you [. hn ] I wanted to know how you feel before I tell her one way or the other so call me [. hn ] right away cos I have to get back to her in about an hour so [. hn ] okay [ Closing: bye [. nhn ] [. onhk ]

SCANMail Demo http: //www. avatarweb. com/scan mail/ Audix extension: demo Audix password: (null)

SCANMail Demo http: //www. avatarweb. com/scan mail/ Audix extension: demo Audix password: (null)

Information Extraction (Martin Jansche and Steve Abney) • Goals: extract key information from msgs

Information Extraction (Martin Jansche and Steve Abney) • Goals: extract key information from msgs to present in headers • Approach: – Supervised learning from transcripts (phone #’s, caller self-ids) – Combine Machine Learning techniques with simpler alternatives, e. g. hand-crafted rules – Two stage approaches

– Features exploit structure of key elements (e. g. length of phone numbers) and

– Features exploit structure of key elements (e. g. length of phone numbers) and of surrounding context (e. g. self-ids tend to occur at beginning of msg)

Telephone Number Identification • Rules convert all numbers to standard digit format • Predict

Telephone Number Identification • Rules convert all numbers to standard digit format • Predict start of phone number with rules – This step over-generates – Prune with decision-tree classifier • Best features: – Position in msg – Lexical cues – Length of digit string • Performance: –. 94 F on human-labeled transcripts –. 95 F on ASR)

Caller Self-Identifications • Predict start of id with classifier – 97% of id’s begin

Caller Self-Identifications • Predict start of id with classifier – 97% of id’s begin 1 -7 words into msg • Then predict length of phrase – Majority are only 2 -4 words long • Avoid risk of relying on correct speech recognition for names • Best cues to end of phrase are a few common words – ‘I’, ‘could’, ‘please’ – No actual names: they over-fit the data • Performance –. 71 F on human-labeled –. 70 F on ASR

Introduction to Weka

Introduction to Weka