Probabilistic and Lexicalized Parsing CS 4705 Probabilistic CFGs

Probabilistic and Lexicalized Parsing CS 4705

Probabilistic CFGs: PCFGs • Weighted CFGs – Attach weights to rules of CFG – Compute weights of derivations – Use weights to choose preferred parses • Utility: Pruning and ordering the search space, disambiguate, Language Model for ASR • Parsing with weighted grammars: find the parse T’ which maximizes the weights of the derivations in the parse tree for all the possible parses of S • T’(S) = argmax. T∈τ(S) W(T, S) • Probabilistic CFGs are one form of weighted CFGs

Rule Probability • Attach probabilities to grammar rules • Expansions for a given non-terminal sum to 1 R 1: VP V. 55 R 2: VP V NP. 40 R 3: VP V NP NP. 05 • Estimate probabilities from annotated corpora – E. g. Penn Treebank – P(R 1)=counts(R 1)/counts(VP)

Derivation Probability • For a derivation T= {R 1…Rn}: – Probability of the derivation: • Product of probabilities of rules expanded in tree – Most likely probable parse: – Probability of a sentence: • Sum over all possible derivations for the sentence • Note the independence assumption: Parse probability does not change based on where the rule is expanded.

One Approach: CYK Parser • Bottom-up parsing via dynamic programming – Assign probabilities to constituents as they are completed and placed in a table – Use the maximum probability for each constituent type going up the tree to S • The Intuition: – We know probabilities for constituents lower in the tree, so as we construct higher level constituents we don’t need to recompute these

CYK (Cocke-Younger-Kasami) Parser • Bottom-up parser with top-down filtering • Uses dynamic programming to store intermediate results (cf. Earley algorithm for top-down case) • Input: PCFG in Chomsky Normal Form – Rules of form A w or A BC; no ε • Chart: array [i, j, A] to hold probability that non-terminal A spans input i-j – Start State(s): (i, i+1, A) for each A wi+1 – End State: (1, n, S) where n is the input size – Next State Rules: (i, k, B) (k, j, C) (i, j, A) if A BC • Maintain back-pointers to recover the parse

Structural Ambiguity • • • NP John | Mary | Denver • V -> called • P -> from S NP VP VP V NP NP PP VP PP PP P NP John called Mary from Denver S S VP NP NP John called NP PP VP V VP NP Mary from Denver John V NP called Mary PP P NP from Denver

Example John called Mary from Denver

Base Case: A w NP P NP V NP John called Mary from Denver

Recursive Cases: A BC NP P NP X V NP called John Mary from Denver

NP P VP NP X V Mary NP called John from Denver

NP X P VP NP from X V Mary NP called John Denver

PP NP X P Denver VP NP from X V Mary NP called John

S NP John PP NP X P Denver VP NP from V Mary called

PP NP Denver X X P S VP NP from X V Mary NP called John

NP X S VP NP X V Mary NP called John PP NP P Denver from

NP PP NP Denver X X X P S VP NP from X V Mary NP called John

VP NP PP NP X X X P Denver S VP NP from X V Mary NP called John

NP PP NP X VP 1 VP 2 X X P Denver S VP NP from X V Mary NP called John

S NP PP NP X VP 1 VP 2 X X P Denver S VP NP from X V Mary NP called John

S VP NP PP NP X X X P Denver S VP NP from X V Mary NP called John

Problems with PCFGs • Probability model just based on rules in the derivation. • Lexical insensitivity: – Doesn’t use words in any real way – But structural disambiguation is lexically driven • PP attachment often depends on the verb, its object, and the preposition • I ate pickles with a fork. • I ate pickles with relish. • Context insensitivity of the derivation – Doesn’t take into account where in the derivation a rule is used • Pronouns more often subjects than objects • She hates Mary. • Mary hates her. • Solution: Lexicalization – Add lexical information to each rule – I. e. Condition the rule probabilities on the actual words

An example: Phrasal Heads • Phrasal heads can ‘take the place of’ whole phrases, defining most important characteristics of the phrase • Phrases generally identified by their heads – Head of an NP is a noun, of a VP is the main verb, of a PP is preposition • Each PFCG rule’s LHS shares a lexical item with a non-terminal in its RHS

Increase in Size of Rule Set in Lexicalized CFG • If R is the number of binary branching rules in CFG and ∑ is the lexicon, O(2*|∑|*|R|) • For unary rules: O(|∑|*|R|)

Example (correct parse) Attribute grammar

Example (less preferred)

Computing Lexicalized Rule Probabilities • We started with rule probabilities as before – VP V NP PP P(rule|VP) • E. g. , count of this rule divided by the number of VPs in a treebank • Now we want lexicalized probabilities – VP(dumped) V(dumped) NP(sacks) PP(into) • i. e. , P(rule|VP ^ dumped is the verb ^ sacks is the head of the NP ^ into is the head of the PP) – Not likely to have significant counts in any treebank

Exploit the Data You Have • So, exploit the independence assumption and collect the statistics you can… • Focus on capturing – Verb subcategorization • Particular verbs have affinities for particular VPs – Objects’ affinity for their predicates • Mostly their mothers and grandmothers • Some objects fit better with some predicates than others

Verb Subcategorization • Condition particular VP rules on their heads – E. g. for a rule r VP -> V NP PP • P(r|VP) becomes P(r ^ V=dumped | VP ^ dumped) – How do you get the probability? • • How many times was rule r used with dumped, divided by the number of VPs that dumped appears in, in total How predictive of r is the verb dumped? – Captures affinity between VP heads (verbs) and VP rules

Example (correct parse)

Example (less preferred)

Affinity of Phrasal Heads for Other Heads: PP Attachment • Verbs with preps vs. Nouns with preps • E. g. dumped with into vs. sacks with into – How often is dumped the head of a VP which includes a PP daughter with into as its head relative to other PP heads or… what’s P(into|PP, dumped is mother VP’s head)) – Vs…how often is sacks the head of an NP with a PP daughter whose head is into relative to other PP heads or… P(into|PP, sacks is mother’s head))

But Other Relationships do Not Involve Heads (Hindle & Rooth ’ 91) • Affinity of gusto for eat is greater than for spaghetti; and affinity of marinara for spaghetti is greater than for ate Vp (ate) Vp(ate) Pp(with) np v Ate spaghetti with gusto Vp(ate) Np(spag) np Pp(with) v Ate spaghetti with marinara

Log-linear models for Parsing • Why restrict to the conditioning to the elements of a rule? – Use even larger context…word sequence, word types, sub-tree context etc. • Compute P(y|x); where fi(x, y) tests properties of context and li is weight of feature • Use as scores in CKY algorithm to find best parse

Supertagging: Almost parsing Poachers now control the underground trade N S VP N N Adv VP S NP control e S N Adv S now S e NP VP Adv now N poachers NP NP VP : : e Adj trade N Det Adj NP control S NP control N N trade S NP NP VP N underground S V : : N N V VP NP underground the NP VP V S poachers V NP NP NP VP now poachers S S S NP VP e V e NP Adj underground NP VP V NP e N : trade

Summary • Parsing context-free grammars – Top-down and Bottom-up parsers – Mixed approaches (CKY, Earley parsers) • Preferences over parses using probabilities – Parsing with PCFG and PCKY algorithms • Enriching the probability model – Lexicalization – Log-linear models for parsing – Super-tagging