Probabilistic Language Processing Chapter 23 Probabilistic Language Models

Probabilistic Language Processing Chapter 23

Probabilistic Language Models • Goal -- define probability distribution over set of strings • Unigram, bigram, n-gram • Count using corpus but need smoothing: – add-one – Linear interpolation • Evaluate with Perplexity measure • E. g. segmentwordswithoutspaces w/ Viterbi

PCFGs • Rewrite rules have probabilities. • Prob of a string is sum of probs of its parse trees. • Context-freedom means no lexical constraints. • Prefers short sentences.

Learning PCFGs • Parsed corpus -- count trees. • Unparsed corpus – Rule structure known -- use EM (inside-outside algorithm) – Rules unknown -- Chomsky normal form… problems.

Information Retrieval • Goal: Google. Find docs relevant to user’s needs. • IR system has doc. Collection, query in some language, set of results, and a presentation of results. • Ideally, parse docs into knowledge base… too hard.

IR 2 • • • Boolean Keyword Model -- in or out? Problem -- single bit of “relevance” Boolean combinations a bit mysterious How compute P(R=true | D, Q)? Estimate language model for each doc, computes prob of query given the model. • Can rank documents by P(r|D, Q)/P(~r|D, Q)

IR 3 • For this, need model of how queries are related to docs. Bag of words: freq of words in doc. , naïve Bayes. • Good example pp 842 -843.

Evaluating IR • Precision is proportion of results that are relevant. • Recall is proportion of relevant docs that are in results • ROC curve (there are several varieties): standard is to plot false negatives vs. false positives. • More “practical” for web: reciprocal rank of first relevant result, or just “time to answer”

IR Refinements • • • Case Stems Synonyms Spelling correction Metadata --keywords

IR Presentation • Give list in order of relevance, deal with duplicates • Cluster results into classes – Agglomerative – K-means • How describe automatically-generated clusters? Word list? Title of centroid doc?

IR Implementation • • CSC 172! Lexicon with “stop list”, “inverted” index: where words occur Match with vectors: vectorof freq of words dotted with query terms.

Information Extraction • Goal: create database entries from docs. • Emphasis on massive data, speed, stylized expressions • Regular expression grammars OK if stylized enough • Cascaded Finite State Transducers, , , stages of grouping and structure-finding

Machine Translation Goals • • • Rough Translation (E. g. p. 851) Restricted Doman (mergers, weather) Pre-edited (Caterpillar or Xerox English) Literary Translation -- not yet! Interlingua-- or canonical semantic representation like Conceptual Dependency • Basic Problem != languages, != categories

MT in Practice • Transfer -- uses data base of rules for translating small units of language • Memory -based. Memorize sentence pairs • Good diagram p. 853

Statistical MT • • • Bilingual corpus Find most likely translation given corpus. Argmax_F P(F|E) = argmax_F P(E|F)P(F) is language model P(E|F) is translation model Lots of interesting problems: fertility (home vs. a la maison). • Horrible drastic simplfications and hacks work pretty well!

Learning and MT • Stat. MT needs: language model, fertility model, word choice model, offset model. • Millions of parameters • Counting , estimate, EM.