Shuffling NonConstituents Jason Eisner with David A Smith
- Slides: 87
Shuffling Non-Constituents Jason Eisner with David A. Smith syntactically-flavored reordering model and Roy Tromble syntactically-flavored reordering search methods ACL SSST Workshop, June 2008 1
Starting point: Synchronous alignment n Synchronous grammars are very pretty. n But does parallel text actually have parallel structure? q Depends on what kind of parallel text n n q Free translations? Noisy translations? Were the parsers trained on parallel annotation schemes? Depends on what kind of parallel structure n n What kinds of divergences can your synchronous grammar formalism capture? E. g. , wh-movement versus wh in situ Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 2
Synchronous Tree Substitution Grammar Two training trees, showing a free translation from French to English. donnent (“give”) kiss à (“to”) baiser (“kiss”) un (“a”) beaucoup (“lots”) Sam often kids quite d’ (“of”) enfants (“kids”) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 3 “beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”
Synchronous Tree Substitution Grammar Two training trees, showing a free translation from French to English. A possible alignment is shown in orange. donnent (“give”) Start kiss à (“to”) baiser (“kiss”) un (“a”) beaucoup (“lots”) d’ (“of”) Sam NP Adv kids null NP often null Adv quite NP enfants (“kids”) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 4 “beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”
Synchronous Tree Substitution Grammar Two training trees, showing a free translation from French to English. A possible alignment is shown in orange. A much worse alignment. . . donnent (“give”) Start kiss à (“to”) baiser (“kiss”) un (“a”) beaucoup (“lots”) d’ (“of”) enfants (“kids”) Sam NP Sam often kids NP quite NP Adv Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 5 “beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”
Synchronous Tree Substitution Grammar Two training trees, showing a free translation from French to English. A possible alignment is shown in orange. donnent (“give”) Start kiss à (“to”) baiser (“kiss”) un (“a”) beaucoup (“lots”) d’ (“of”) Sam NP Adv kids null NP often null Adv quite NP enfants (“kids”) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 6 “beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”
Grammar = Set of Elementary Trees donnent (“give”) Start kiss à (“to”) baiser (“kiss”) un (“a”) NP Adv null NP Adv beaucoup (“lots”) d’ (“of”) NP NP kids NP enfants (“kids”) often null Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 NP Sam null Adv Sam quite 7
But many examples are harder Auf To diese this Frage question habe have ich I leider alas keine no Antwort answer bekommen received NULL I did not unfortunately receive an answer Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 to this question 8
But many examples are harder Auf To diese this Frage question habe have ich I leider alas keine no Antwort answer bekommen received NULL I did not unfortunately receive an answer to this question Displaced modifier (negation) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 9
But many examples are harder Auf To diese this Frage question habe have ich I leider alas keine no Antwort answer bekommen received NULL I did not unfortunately receive an answer to this question Displaced modifier (negation) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 10
But many examples are harder Auf To diese Frage habe this question have ich I leider alas keine no Antwort answer bekommen received NULL I did not unfortunately receive an answer to this question Displaced argument (here, because projective parser) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 11
But many examples are harder Auf To diese this Frage question habe have ich I leider alas keine no Antwort answer bekommen received NULL I did not unfortunately receive Head-swapping an answer to this question (here, different annotation conventions) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 12
Free Translation Tschernobyl Chernobyl könnte could dann etwas später then something later an on die Reihe kommen the queue come NULL Then we could deal with Chernobyl Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 some time later 13
Free Translation Tschernobyl Chernobyl könnte could dann etwas später then something later an on die Reihe kommen the queue come NULL Then we could deal with Chernobyl some time later Probably not systematic (but words are correctly aligned) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 14
Free Translation Tschernobyl Chernobyl könnte could dann etwas später then something later an on die Reihe kommen the queue come NULL Then we could deal with Chernobyl some time later Erroneous parse Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 15
What to do? n Current practice: q q q Don’t try to model all systematic phenomena! Just use non-syntactic alignments (Giza++). Only care about the fragments that recur often n n q Phrases or gappy phrases Sometimes even syntactic constituents (can favor these, e. g. , Marton & Resnik 2008) Use these (gappy) phrases in a decoder n Phrase based or hierarchical Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 16
What to do? n Current practice: q q n But could syntax give us better alignments? q n Use non-syntactic alignments (Giza++) Keep frequent phrases for a decoder Would have to be “loose” syntax … Why do we want better alignments? 1. Throw away less of the parallel training data 2. Help learn a smarter, syntactic, reordering model q Could help decoding: less reliance on LM 3. Some applications care about full alignments Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 17
Quasi-synchronous grammar n n How do we handle “loose” syntax? Translation story: q Generate target English by a monolingual grammar n n Any grammar formalism is okay Pick a dependency grammar formalism for now P(I | did, PRP) I did not unfortunately receive an answer P(PRP | no previous left children of “did”) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 to this question parsing: O(n 3) 18
Quasi-synchronous grammar n n How do we handle “loose” syntax? Translation story: q q Generate target English by a monolingual grammar But probabilities are influenced by source sentence n n I did Each English node is aligned to some source node Prefers to generate children aligned to nearby source nodes not unfortunately receive an answer to this question parsing: O(n 3) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 19
QCFG Generative Story observed Auf diese Frage habe ich leider keine Antwort bekommen NULL P(parent-child) P(I | did, PRP, ich) I did not P(breakage) unfortunately receive an answer P(PRP | no previous left children of “did”, habe) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 to this question aligned parsing: O(m 2 n 3) 20
What’s a “nearby node”? n Given parent’s alignment, where might child be aligned? synchronous grammar case + “none of the above” Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 21
Quasi-synchronous grammar n n How do we handle “loose” syntax? Translation story: q q n Generate target English by a monolingual grammar But probabilities are influenced by source sentence Useful analogies: 1. 2. Generative grammar with latent word senses Source MEMM 1. Generate n-gram tag sequence, Target but probabilities are influenced by word sequence Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 22
Quasi-synchronous grammar n n How do we handle “loose” syntax? Translation story: q q n Generate target English by a monolingual grammar But probabilities are influenced by source sentence Useful analogies: 1. 2. 3. Generative grammar with latent word senses MEMM IBM Model 1 1. 2. Source nodes can be freely reused or unused Future work: Enforce 1 -to-1 to allow good decoding (NP-hard to do exactly) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 23
Some results: Quasi-synch. Dep. Grammar n Alignment (D. Smith & Eisner 2006) q q n Quasi-synchronous much better than synchronous Maybe also better than IBM Model 4 Question answering (Wang et al. 2007) q q Align question w/ potential answer Mean average precision 43% 48% 60% n n previous state of the art + QG + lexical features Bootstrapping a parser for a new language q (D. Smith & Eisner 2007 & ongoing) Learn how parsed parallel text influences target dependencies n q Along with many other features! (cf. co-training) Unsupervised: German 30% 69%, Spanish 26% 65% Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 24
Summary of part I n Current practice: q q q Use non-syntactic alignments (Giza++) Some bits align nicely Use the frequent bits in a decoder n Suggestion: Let syntax influence alignments. n So far, loose syntax methods are like IBM Model I. q n NP-hard to enforce 1 -to-1 in any interesting model. Rest of talk: q q How to enforce 1 -to-1 in interesting models? Can we do something smarter than beam search? Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 25
Shuffling Non-Constituents Jason Eisner with David A. Smith syntactically-flavored reordering model and Roy Tromble syntactically-flavored reordering search methods ACL SSST Workshop, June 2008 26
Motivation n MT is really easy! Just use a finite-state transducer! Phrases, morphology, the works! Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 27
Permutation search in MT NNP NEG PRP AUX NEG Marie ne m’ a pas 1 4 2 5 6 3 best order (French’) seen me easy transduction 1 Mary 2 3 hasn’t 4 5 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 VBN 6 initial order vu (French) 28
Motivation n MT is really easy! Just use a finite-state transducer! Phrases, morphology, the works! n Have just to fix that pesky word order. n n Framing it this way lets us enforce 1 -to-1 exactly at the permutation step. Deletion and fertility > 1 are still allowed in the subsequent transduction. Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 29
Often want to find an optimal permutation … n n n Machine translation: Reorder French to French-prime (Brown et al. 1992) So it’s easier to align or translate MT eval: How much do you need to rearrange MT output so it scores well under an LM derived from ref translations? Discourse generation, e. g. , multi-doc summarization: Order the output sentences (Lapata 2003) So they flow nicely Reconstruct temporal order of events after info extraction Learn rule ordering or constraint ranking for phonology? Multi-word anagrams that score well under a LM Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 30
Permutation search: The problem 1 2 3 4 5 6 1 4 2 5 6 3 initial order best order according to How can we find this needle some cost in the haystack of N! function possible permutations? Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 32
Traditional approach: Beam search Approx. best path through a really big FSA N! paths: one for each permutation only 2 N states state remembers what we’ve generated so far (but not in what order) arc weight = cost of picking 5 next if we’ve seen {1, 2, 4} so far Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 33
An alternative: Local search (“hill climbing”) The SWAP neighborhood 132456 213456 cost=20 cost=26 123456 cost=22 124356 cost=19 123546 cost=25 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 34
An alternative: Local search (“hillclimbing”) The SWAP neighborhood 123456 cost=22 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 124356 cost=19 35
An alternative: Local search (“hillclimbing”) The SWAPofneighborhood Like “greedy decoder” Germann et al. 2001 1 2 3 4 5 6 cost=22 cost=19 cost=17 cost=16. . . we pick best swap Why are the costs always going down? How long does it take to pick best swap? O(N) if you’re careful O(N 2) How many swaps might you need to reach answer? random restarts What if you get stuck in a local min? Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 36
Larger neighborhood 132456 213456 cost=20 cost=26 123456 cost=22 124356 cost=19 123546 cost=25 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 37
Larger neighborhood (well-known in the literature; reportedly works well) INSERT neighborhood 1 2 3 4 5 6 cost=22 cost=17 Fewer local minima? yes – 3 can move past 4 to get past 5 Graph diameter (max #moves needed)? O(N) rather than O(N 2) rather than O(N) How many neighbors? O(N 2) rather than O(N) How long to find best neighbor? Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 38
Even larger neighborhood BLOCK neighborhood 1 2 3 4 5 6 cost=22 cost=14 yes – 2 can get past 45 without having to cross 3 or move 3 first Fewer local minima? still O(N) Graph diameter (max #moves needed)? O(N 3) rather than O(N), O(N 2) How many neighbors? How long to find best neighbor? O(N 3) rather than O(N), O(N 2) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 39
Larger yet: Via dynamic programming? ? 1 2 3 4 5 6 Fewer local minima? Graph diameter (max #moves needed)? How many neighbors? How long to find best neighbor? Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 cost=22 logarithmic exponential polynomial 40
Unifying/generalizing neighborhoods so far 1 i 2 3 j 4 5 6 7 k 8 Exchange two adjacent blocks, of max widths w ≤ w’ Move is defined by an (i, j, k) triple SWAP: w=1, w’=1 INSERT: w=1, w’=N BLOCK: w=N, w’=N runtime = # neighbors = O(ww’N) O(N) everything in this talk can be generalized to O(N 2) other values of w, w’ O(N 3) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 41
Very large-scale neighborhoods n What if we consider multiple simultaneous exchanges that are “independent”? 1 n 3 2 5 4 6 The DYNASEARCH neighborhood (Potts & van de Velde 1995; Congram 2000) 22 1 11 4 3 2 3 3 4 2 55 6 Lowest-cost neighbor is lowest-cost path 5 6 5 44 Cost of this arc is Δcost of swapping (4, 5), here < 0 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 42
Very large-scale neighborhoods 2 1 1 4 2 3 6 4 2 3 n 3 Lowest-cost neighbor is lowest-cost path 5 5 6 4 5 Why would this be a good idea? Help get out of bad local minima? no; they’re still local minima yes – less greedy Help avoid getting into bad local minima? 0 B= -20 0 80 0 0 -30 -0 0 -20 0 0 2 1 1 2 3 4 3 3 4 2 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 DYNASEARCH (-20+-20) SWAP (-30) 43
Very large-scale neighborhoods 2 1 1 2 3 n 4 3 3 4 2 5 6 5 5 6 Lowest-cost neighbor is lowest-cost path 4 Why would this be a good idea? Help get out of bad local minima? no; they’re still local minima yes – less greedy Help avoid getting into bad local minima? More efficient? yes! – shortest-path algorithm finds the best set of swaps in O(N) time, as fast as best single swap. Up to N moves as fast as 1 move: no penalty for “parallelism”! Globally optimizes over exponentially many neighbors (paths). Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 44
Can we extend this idea – up to N moves in parallel by dynamic programming – to neighborhoods beyond SWAP? 1 i 2 3 j 4 5 6 7 k 8 Exchange two adjacent blocks, of max widths w ≤ w’ Move is defined by an (i, j, k) triple SWAP: w=1, w’=1 INSERT: w=1, w’=N BLOCK: w=N, w’=N runtime = # neighbors = O(ww’N) O(N) Yes. 2) O(N Asymptotic runtime is always unchanged. O(N 3) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 45
Let’s define each neighbor by a “colored tree” Just like ITG! = swap children 1 2 3 4 5 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 6 46
Let’s define each neighbor by a “colored tree” Just like ITG! = swap children 1 2 3 4 5 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 6 47
Let’s define each neighbor by a “colored tree” Just like ITG! = swap children 5 6 1 2 3 4 This is like the BLOCK neighborhood, but with multiple block exchanges, which may be nested. Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 48
If that was the optimal neighbor … … now look for its optimal neighbor new tree! 5 6 1 4 2 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 3 49
If that was the optimal neighbor … … now look for its optimal neighbor new tree! 5 6 1 4 2 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 3 50
If that was the optimal neighbor … … now look for its optimal neighbor … repeat till reach local optimum Each tree defines a neighbor. At each step, optimize over all possible trees by dynamic programming (CKY parsing). 1 4 2 5 6 3 Use your favorite parsing speedups (pruning, best-first, …) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 51
Very-large-scale versions of SWAP, INSERT, and BLOCK all by the algorithm we just saw … 1 i 2 3 j 4 5 6 7 k 8 Exchange two adjacent blocks, of max widths w ≤ w’ Move is defined by an (i, j, k) triple Runtime of the algorithm we just saw was O(N 3) because we considered O(N 3) distinct (i, j, k) triples More generally, restrict to only the O(ww’N) triples of interest to define a smaller neighborhood with runtime of O(ww’N). (yes, the dynamic programming recurrences go through) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 52
How many steps to get from here to there? 6 2 5 8 4 3 7 1 initial order One twisted-tree step? No: As you probably know, 3 1 4 2 1 2 3 4 is impossible. 1 2 3 4 5 6 7 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 8 best order 53
Can you get to the answer in one step? German-English, Giza++ alignment not always (yay, local search) often (yay, big neighborhood) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 54
How many steps to the answer in the worst case? (what is diameter of the search space? ) 6 2 5 8 4 3 7 1 claim: only log 2 N steps at worst (if you know where to step) Let’s sketch the proof! 1 2 3 4 5 6 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 7 8 55
Quicksort anything into, e. g. , 1 2 3 4 5 6 7 8 right-branching tree 6 2 5 4 8 4 7 3 1 5 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 56
Quicksort anything into, e. g. , 1 2 3 4 5 6 7 8 Only log 2 N steps to get to 1 2 3 4 5 6 7 8 … … orsequence to anywhere! of right-branching trees 2 4 2 4 3 3 1 7 8 5 6 5 6 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 7 57
Defining “best order” What class of cost functions can we handle efficiently? How fast can we compute a subtree’s cost from its child subtrees? 1 2 3 4 5 6 1 4 2 5 6 3 initial order best order according to How can we find this needle some cost in the haystack of N! function possible permutations? Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 58
Defining “best order” 0 -30 What class of cost functions? A = n erso p s e l a S g ) n i l P e S “Trav blem” (T Pro 15 12 7 6 15 22 80 0 -76 24 63 -44 0 -15 71 -99 28 8 -31 5 -7 0 54 -6 41 24 0 82 5 -22 8 93 0 -9 a 14 + a 42 + a 25 + a 56 + a 63 + a 31 1 4 2 5 6 3 best order according to How can we find this needle some cost in the haystack of N! function possible permutations? Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 59
Defining “best order” 0 93 8 6 8 -31 -6 54 0 24 82 88 17 -6 0 12 g n i r e 11 -17 10 -59 0 d r O r a ) e P n i 5 4 -12 6 55 O “L L ( ” m e l b o b 26 = cost of 2 preceding 6 Pr -60 What class of cost functions? B= 5 -22 12 0 -7 41 -9 23 0 (add up n(n-1)/2 such costs) (any order will incur either b 26 or b 62) 1 4 2 5 6 3 best order according to How can we find this needle some cost in the haystack of N! function possible permutations? Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 60
Defining “best order” What class of cost functions? n n TSP and LOP are both NP-complete In fact, believed to be inapproximable q n hard even to achieve C * optimal cost (any C≥ 1) Practical approaches: Ø correct answer, typically fast branch-and-bound, Ø fast answer, typically close to correct beam search, ILP, … this talk, … Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 61
Defining “best order” What class of cost functions? 1 2 3 4 5 6 initial order 1 4 2 5 6 3 cost of this order: 4 1… 2… 3? before 3 …? Generalizes TSP 1. Does my favorite WFSA like this string of #s? 2. Non-local pair order ok? 3. Non-local triple order ok? 4. Can add these all up … LOP Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 63
Costs are derived from source sentence features NNP 1 NEG PRP 2 Marie AUX 4 3 ne m’ a 5 -30 A= 15 12 7 6 15 0 24 63 -44 12 0 0 -15 71 -99 -7 88 80 0 -76 8 -31 5 0 54 -6 41 24 0 82 5 -22 8 93 0 -9 B= (French) vu -7 22 28 6 pas ne would like to be brought adjacent to the next NEG word 0 VBN initial order NEG 5 -22 93 8 6 8 -31 -6 54 41 0 -9 24 82 17 -6 0 11 -17 -75 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 10 -59 4 -12 6 12 -60 0 23 55 0 64
Costs are derived from source sentence features NNP 1 NEG 2 Marie -30 15 12 7 6 15 AUX 4 3 ne 0 A= PRP m’ a 22 80 0 -76 24 63 -44 0 -15 71 -99 28 8 -31 5 -7 0 54 -6 41 24 0 82 5 -22 8 93 0 -9 VBN initial order NEG 5 6 pas (French) vu 50: a verb (e. g. , vu) shouldn’t precede its subject (e. g. , Marie) 0 5 a-22 93 8 56 +27: words at distance of shouldn’t 12 swap 0 order 8 -31 -6 54 -2: words -7 with 41 PRP 0 between -9 24 82 them ought to swap 88 17 -6 0 12 -60 … = 75 B= 11 -17 75 10 -59 4 -12 6 0 23 55 0 Can also include phrase boundary symbols in the input! Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 65
Costs are derived from source sentence features NNP 1 NEG PRP 2 Marie AUX 4 3 ne m’ a VBN initial order NEG 5 6 pas (French) vu FSA costs: Distortion model Language model – looks ahead to next step! ( good finite-state translation into good English? ) 0 -30 A= 15 12 7 6 15 -7 0 24 63 -44 12 0 0 -15 71 -99 -7 88 22 80 0 -76 28 8 -31 5 0 54 -6 41 24 0 82 5 -22 8 93 0 -9 B= 5 -22 93 8 6 8 -31 -6 54 41 0 -9 24 82 17 -6 0 11 -17 75 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 10 -59 4 -12 6 12 -60 0 23 55 0 66
Dynamic program must pick the tree that leads to the lowest-cost permutation 1 2 3 4 5 6 initial order 1 4 2 5 6 3 cost of this order: 1. Does my favorite WFSA like it as a string? Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 67
Scoring with a weighted FSA This particular WFSA implements TSP scoring for N=3: After you read 1, you’re in state 1 After you read 2, you’re in state 2 After you read 3, you’re in state 3 … and this state determines the cost of the next symbol you read nitial We’ll handle a WFSA with Q states by using a fancier grammar, with nonterminals. (Now runtime goes up to O(N 3 Q 3) …) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 68
Including WFSA costs via nonterminals A possible preterminal for word 2 is an arc in A that’s labeled with 2. 4 2 2 The preterminal 4 2 rewrites as word 2 with a cost equal to the arc’s cost. 6 1 1 4 2 2 3 1 4 4 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 I 5 5 5 6 6 69
5 I 6 5 6 1 1. 4 4 2 2 3 3 Including WFSA costs via nonterminals This constituent’s total cost is the total cost of the best 6 3 path I 3 6 1 1 4 4. 2 2 3 cost of the new permutation 3 6 3 1 3 I 6 6 1 1 4 3 I 6 4 2 2 3 1 4 4 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 I 5 5 5 6 6 70
Dynamic program must pick the tree that leads to the lowest-cost permutation 1 2 3 4 5 6 initial order 1 4 2 5 6 3 cost of this order: 4 before 3 …? 1. Does my favorite WFSA like it as a string? 2. Non-local pair order ok? Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 71
Incorporating the pairwise ordering costs This puts {5, 6, 7} before {1, 2, 3, 4}. 1 2 3 4 5 6 So this hypothesis must add costs 5 < 1, 5 < 2, 5 < 3, 5 < 4, 6 < 1, 6 < 2, 6 < 3, 6 < 4, 7 < 1, 7 < 2, 7 < 3, 7 < 4 7 Uh-oh! So now it takes O(N 2) time to combine two subtrees, instead of O(1) time? Nope – dynamic programming to the rescue again! Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 72
Computing LOP cost of a block 1 2 3 4 move This puts {5, 6, 7} before {1, 2, 3, 4}. 5 revise So we have to add O(N 2) costs just to consider this single neighbor! 1 2 3 4 5 6 = 5 6 7 Reuse work from other, “narrower” block moves … computed new cost in O(1)! 7 1 2 3 4 + 5 6 7 1 2 3 4 - 5 6 7 1 2 3 4 5 +6 7 already computed at earlier steps Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 of parsing 73
Incorporating 3 -way ordering costs n See the initial paper (Eisner & Tromble 2006) n A little tricky, but q q comes “for free” if you’re willing to accept a certain restriction on these costs more expensive without that restriction, but possible Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 74
Another option: Markov chain Monte Carlo n Random walk in the space of permutations q interpret a permutation’s cost as a log-probability n Sample a permutation from the neighborhood instead of always picking the most probable n Why? q q Simulated annealing might beat greedy-with-random-restarts When learning the parameters of the distribution, can use sampling to compute the feature expectations Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 75
Another option: Markov chain Monte Carlo n Random walk in the space of permutations q interpret a permutation’s cost as a log-probability n Sample a permutation from the neighborhood instead of always picking the most probable n How? q Pitfall: Sampling a permutation sampling a tree n q Spurious ambiguity: some permutations have many trees Solution: Exclude some trees, leaving 1 permutation n n Normal form has long been known for colored trees For restricted colored trees (which limit the size of blocks to swap), we have devised a more complicated normal form Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 76
Learning the costs n n Where do these costs come from? If we have some examples on which we know the true permutation, could try to learn them 0 -30 A= 15 12 7 6 15 -7 0 24 63 -44 12 0 0 -15 71 -99 -7 88 22 80 0 -76 28 8 -31 5 0 54 -6 41 24 0 82 5 -22 8 93 0 -9 B= 5 -22 93 8 6 8 -31 -6 54 41 0 -9 24 82 17 -6 0 11 -17 75 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 10 -59 4 -12 6 12 -60 0 23 55 0 78
Learning the costs n n n Where do these costs come from? If we have some examples on which we know the true permutation, could try to learn them More precisely, try to learn these weights θ (the knowledge that’s reused across examples) 0 -30 A= 15 12 7 6 15 22 80 0 -76 24 63 -44 0 -15 71 -99 28 8 -31 5 -7 0 54 -6 41 24 0 82 5 -22 8 93 0 -9 50: a verb (e. g. , vu) shouldn’t precede its subject (e. g. , Marie) 0 at 5 a -22 93 of 8 5 6 27: words distance shouldn’t 12 swap 0 order 8 -31 -6 54 -2: words -7 with 41 PRP 0 between -9 24 82 them ought to swap 88 17 -6 0 12 -60 … B= 11 -17 75 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 10 -59 4 -12 6 0 23 55 0 79
Experimenting with training LOP params (LOP is quite fast: O(n 3) with no grammar constant) PDS VMFIN PPER ADV Das kann ich so APPR ART NN PTKNEG VVINF $. aus dem Stand nicht sagen. B[7, 9] Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 83
LOP feature templates Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 84
LOP feature templates n n n Only LOP features so far And they’re unnecessarily simple (don’t examine syntactic constituency) And input sequence is only words (not interspersed with syntactic brackets) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 85
Learning LOP Costs for MT (interesting, if odd, to try to reorder with only the LOP costs) MOSES baseline German n LOP German’ English MOSES Define German’ to be German in English word order q To get German’ for training data, use Giza++ to align all German positions to English positions (disallow NULL) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 86
Learning LOP Costs for MT (interesting, if odd, to try to reorder with only the LOP costs) MOSES baseline German n LOP German’ English MOSES Easy first try: Naïve Bayes q q q Treat each feature in θ as independent Count and normalize over the training data No real improvement over baseline Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 87
Learning LOP Costs for MT (interesting, if odd, to try to reorder with only the LOP costs) MOSES baseline German n LOP German’ English MOSES Easy second try: Perceptron search. . . local optimum error update el d o m or r r e gold standard global optimum Note: Search error can be beneficial, e. g. , just take 1 step from identity permutatio Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 88
Benefit from reordering Learning method BLEU vs. German′ English No reordering 49. 65 Naïve Bayes—POS 49. 21 Naïve Bayes—POS+lexical 49. 75 Perceptron—POS 50. 05 25. 92 Perceptron—POS+lexical 51. 30 26. 34 25. 55 obviously, not yet unscrambling German: need more features Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 90
Alternatively, work back from gold standard n Contrastive estimation (Smith & Eisner 2005) 1 -step verylarge-scale neighborhood n gold standard Maximize the probability of the desired permutation relative to its ITG neighborhood n Requires summing all permutations in a neighborhood q n Must use normal-form trees here Stochastic gradient descent Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 91
Alternatively, work back from gold standard n k-best MIRA in the neighborhood 1 -step verylarge-scale neighborhood n n gold standard current winners in the neighborhood Make gold standard beat its local competitors Beat the bad ones by a bigger margin q q q Good = close to gold in swap distance? Good = close to gold using BLEU? Good = translates into English that’s close to reference? Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 92
Alternatively, train each iterate model best in neigh of (0) . . . update oracle in neigh of (0) n Or could do a k-best MIRA version of this, too; even use a loss measure based on lookahead to (n) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 93
Summary of part II n Local search is fun and easy q q n Probably useful for translation q n Popular elsewhere in AI Closely related to MCMC sampling Maybe other NP-hard problems too Can efficiently use huge local neighborhoods q q Algorithms are closely related to parsing and FSMs Our community knows that stuff better than anyone! Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 95
- Veselin stoyanov md
- Oligonucleotide directed mutagenesis
- Haustral shuffling
- Exon shuffling spiegazione
- Samantha eisner
- Helen eisner
- Cindy eisner
- Cindy eisner
- David randolph smith & associates
- Ghostwriter ralph helmick
- Expressionist sculpture
- Jason weston facebook
- Dr. michael savatteri
- Jason fishbain
- Carswats
- Mortgage brokers arkell
- Jason atis
- Brad whittle
- The quest of the golden fleece
- Vanessa jason
- Jason kepler
- Makey makey.com/piano
- Jason cong
- Creeper virus
- Jason sharman cambridge
- Angelos keromytis
- Sadie is so fearful of being overwhelmed
- Jason zigmont
- Thor brontcast
- Jason eric johnson
- Jason tabalujan
- Jason gledhill
- Jason freewalt
- Jason t. haraldsen
- Jason ng md
- Jason foundation quiz answers module 2
- Jason adolf
- Jason shedrick
- Fina 3210
- Jason covarrubias
- Jason gunthorpe
- Jason coughlin
- Dr jason cheah
- Jason and the argonauts greek mythology
- Jason prince ofac
- Jason krumholz
- Jason speyer
- Dr jason lang
- Jason perry md
- Jason sawin
- Microsoft update
- Name
- Jason ryba
- When was jason scarpace born
- Total safety leadership
- Jason haugen
- Knowledge skill
- Victoria matasau
- Jason detwiler
- Jason adsit
- Twisted tetris
- Jason bakos
- Jason lillis
- Jason profetto
- Jason aldred md
- Stfc pmc
- Jason penny
- Jason diederich
- Jason cong
- Dr jason compton
- Jason lundquist
- Emily sadler
- Jason panter
- Jason matheny
- Dr robert jason
- Jason parsons liverpool
- Jason laks
- Terri tassie
- Insect jason
- Jason stauth
- Hydras
- Ps 7 nyc
- Glucenra
- Chase richard trenton
- Jason gaskell
- Jason bumbard
- Jason lee twitter
- Jason reedy