Shuffling NonConstituents Jason Eisner with David A Smith

Starting point: Synchronous alignment n Synchronous grammars are very pretty. n But does parallel

Synchronous Tree Substitution Grammar Two training trees, showing a free translation from French to

Grammar = Set of Elementary Trees donnent (“give”) Start kiss à (“to”) baiser (“kiss”)

But many examples are harder Auf To diese this Frage question habe have ich

But many examples are harder Auf To diese Frage habe this question have ich

Free Translation Tschernobyl Chernobyl könnte could dann etwas später then something later an on

What to do? n Current practice: q q q Don’t try to model all

What to do? n Current practice: q q n But could syntax give us

Quasi-synchronous grammar n n How do we handle “loose” syntax? Translation story: q Generate

Quasi-synchronous grammar n n How do we handle “loose” syntax? Translation story: q q

QCFG Generative Story observed Auf diese Frage habe ich leider keine Antwort bekommen NULL

What’s a “nearby node”? n Given parent’s alignment, where might child be aligned? synchronous

Some results: Quasi-synch. Dep. Grammar n Alignment (D. Smith & Eisner 2006) q q

Summary of part I n Current practice: q q q Use non-syntactic alignments (Giza++)

Motivation n MT is really easy! Just use a finite-state transducer! Phrases, morphology, the

Permutation search in MT NNP NEG PRP AUX NEG Marie ne m’ a pas

Often want to find an optimal permutation … n n n Machine translation: Reorder

Permutation search: The problem 1 2 3 4 5 6 1 4 2 5

Traditional approach: Beam search Approx. best path through a really big FSA N! paths:

An alternative: Local search (“hill climbing”) The SWAP neighborhood 132456 213456 cost=20 cost=26 123456

An alternative: Local search (“hillclimbing”) The SWAP neighborhood 123456 cost=22 Eisner, D. A. Smith,

An alternative: Local search (“hillclimbing”) The SWAPofneighborhood Like “greedy decoder” Germann et al. 2001

Larger neighborhood 132456 213456 cost=20 cost=26 123456 cost=22 124356 cost=19 123546 cost=25 Eisner, D.

Larger neighborhood (well-known in the literature; reportedly works well) INSERT neighborhood 1 2 3

Even larger neighborhood BLOCK neighborhood 1 2 3 4 5 6 cost=22 cost=14 yes

Larger yet: Via dynamic programming? ? 1 2 3 4 5 6 Fewer local

Unifying/generalizing neighborhoods so far 1 i 2 3 j 4 5 6 7 k

Very large-scale neighborhoods n What if we consider multiple simultaneous exchanges that are “independent”?

Very large-scale neighborhoods 2 1 1 4 2 3 6 4 2 3 n

Very large-scale neighborhoods 2 1 1 2 3 n 4 3 3 4 2

Can we extend this idea – up to N moves in parallel by dynamic

Let’s define each neighbor by a “colored tree” Just like ITG! = swap children

If that was the optimal neighbor … … now look for its optimal neighbor

Very-large-scale versions of SWAP, INSERT, and BLOCK all by the algorithm we just saw

How many steps to get from here to there? 6 2 5 8 4

Can you get to the answer in one step? German-English, Giza++ alignment not always

How many steps to the answer in the worst case? (what is diameter of

Quicksort anything into, e. g. , 1 2 3 4 5 6 7 8

Defining “best order” What class of cost functions can we handle efficiently? How fast

Defining “best order” 0 -30 What class of cost functions? A = n erso

Defining “best order” 0 93 8 6 8 -31 -6 54 0 24 82

Defining “best order” What class of cost functions? n n TSP and LOP are

Defining “best order” What class of cost functions? 1 2 3 4 5 6

Costs are derived from source sentence features NNP 1 NEG PRP 2 Marie AUX

Costs are derived from source sentence features NNP 1 NEG 2 Marie -30 15

Dynamic program must pick the tree that leads to the lowest-cost permutation 1 2

Scoring with a weighted FSA This particular WFSA implements TSP scoring for N=3: After

Including WFSA costs via nonterminals A possible preterminal for word 2 is an arc

Incorporating the pairwise ordering costs This puts {5, 6, 7} before {1, 2, 3,

Computing LOP cost of a block 1 2 3 4 move This puts {5,

Incorporating 3 -way ordering costs n See the initial paper (Eisner & Tromble 2006)

Another option: Markov chain Monte Carlo n Random walk in the space of permutations

Learning the costs n n Where do these costs come from? If we have

Learning the costs n n n Where do these costs come from? If we

Experimenting with training LOP params (LOP is quite fast: O(n 3) with no grammar

LOP feature templates Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008

LOP feature templates n n n Only LOP features so far And they’re unnecessarily

Learning LOP Costs for MT (interesting, if odd, to try to reorder with only

Benefit from reordering Learning method BLEU vs. German′ English No reordering 49. 65 Naïve

Alternatively, work back from gold standard n Contrastive estimation (Smith & Eisner 2005) 1

Alternatively, work back from gold standard n k-best MIRA in the neighborhood 1 -step

Alternatively, train each iterate model best in neigh of (0) . . . update

Summary of part II n Local search is fun and easy q q n

Slides: 87

Download presentation

Shuffling Non-Constituents Jason Eisner with David A. Smith syntactically-flavored reordering model and Roy Tromble syntactically-flavored reordering search methods ACL SSST Workshop, June 2008 1

Starting point: Synchronous alignment n Synchronous grammars are very pretty. n But does parallel text actually have parallel structure? q Depends on what kind of parallel text n n q Free translations? Noisy translations? Were the parsers trained on parallel annotation schemes? Depends on what kind of parallel structure n n What kinds of divergences can your synchronous grammar formalism capture? E. g. , wh-movement versus wh in situ Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 2

Synchronous Tree Substitution Grammar Two training trees, showing a free translation from French to English. donnent (“give”) kiss à (“to”) baiser (“kiss”) un (“a”) beaucoup (“lots”) Sam often kids quite d’ (“of”) enfants (“kids”) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 3 “beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”

Synchronous Tree Substitution Grammar Two training trees, showing a free translation from French to English. A possible alignment is shown in orange. donnent (“give”) Start kiss à (“to”) baiser (“kiss”) un (“a”) beaucoup (“lots”) d’ (“of”) Sam NP Adv kids null NP often null Adv quite NP enfants (“kids”) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 4 “beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”

Synchronous Tree Substitution Grammar Two training trees, showing a free translation from French to English. A possible alignment is shown in orange. A much worse alignment. . . donnent (“give”) Start kiss à (“to”) baiser (“kiss”) un (“a”) beaucoup (“lots”) d’ (“of”) enfants (“kids”) Sam NP Sam often kids NP quite NP Adv Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 5 “beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”

Synchronous Tree Substitution Grammar Two training trees, showing a free translation from French to English. A possible alignment is shown in orange. donnent (“give”) Start kiss à (“to”) baiser (“kiss”) un (“a”) beaucoup (“lots”) d’ (“of”) Sam NP Adv kids null NP often null Adv quite NP enfants (“kids”) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 6 “beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”

Grammar = Set of Elementary Trees donnent (“give”) Start kiss à (“to”) baiser (“kiss”) un (“a”) NP Adv null NP Adv beaucoup (“lots”) d’ (“of”) NP NP kids NP enfants (“kids”) often null Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 NP Sam null Adv Sam quite 7

But many examples are harder Auf To diese this Frage question habe have ich I leider alas keine no Antwort answer bekommen received NULL I did not unfortunately receive an answer Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 to this question 8

But many examples are harder Auf To diese this Frage question habe have ich I leider alas keine no Antwort answer bekommen received NULL I did not unfortunately receive an answer to this question Displaced modifier (negation) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 9

But many examples are harder Auf To diese Frage habe this question have ich I leider alas keine no Antwort answer bekommen received NULL I did not unfortunately receive an answer to this question Displaced argument (here, because projective parser) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 11

But many examples are harder Auf To diese this Frage question habe have ich I leider alas keine no Antwort answer bekommen received NULL I did not unfortunately receive Head-swapping an answer to this question (here, different annotation conventions) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 12

Free Translation Tschernobyl Chernobyl könnte could dann etwas später then something later an on die Reihe kommen the queue come NULL Then we could deal with Chernobyl Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 some time later 13

Free Translation Tschernobyl Chernobyl könnte could dann etwas später then something later an on die Reihe kommen the queue come NULL Then we could deal with Chernobyl some time later Probably not systematic (but words are correctly aligned) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 14

Free Translation Tschernobyl Chernobyl könnte could dann etwas später then something later an on die Reihe kommen the queue come NULL Then we could deal with Chernobyl some time later Erroneous parse Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 15

What to do? n Current practice: q q q Don’t try to model all systematic phenomena! Just use non-syntactic alignments (Giza++). Only care about the fragments that recur often n n q Phrases or gappy phrases Sometimes even syntactic constituents (can favor these, e. g. , Marton & Resnik 2008) Use these (gappy) phrases in a decoder n Phrase based or hierarchical Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 16

What to do? n Current practice: q q n But could syntax give us better alignments? q n Use non-syntactic alignments (Giza++) Keep frequent phrases for a decoder Would have to be “loose” syntax … Why do we want better alignments? 1. Throw away less of the parallel training data 2. Help learn a smarter, syntactic, reordering model q Could help decoding: less reliance on LM 3. Some applications care about full alignments Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 17

Quasi-synchronous grammar n n How do we handle “loose” syntax? Translation story: q Generate target English by a monolingual grammar n n Any grammar formalism is okay Pick a dependency grammar formalism for now P(I | did, PRP) I did not unfortunately receive an answer P(PRP | no previous left children of “did”) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 to this question parsing: O(n 3) 18

Quasi-synchronous grammar n n How do we handle “loose” syntax? Translation story: q q Generate target English by a monolingual grammar But probabilities are influenced by source sentence n n I did Each English node is aligned to some source node Prefers to generate children aligned to nearby source nodes not unfortunately receive an answer to this question parsing: O(n 3) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 19

QCFG Generative Story observed Auf diese Frage habe ich leider keine Antwort bekommen NULL P(parent-child) P(I | did, PRP, ich) I did not P(breakage) unfortunately receive an answer P(PRP | no previous left children of “did”, habe) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 to this question aligned parsing: O(m 2 n 3) 20

What’s a “nearby node”? n Given parent’s alignment, where might child be aligned? synchronous grammar case + “none of the above” Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 21

Quasi-synchronous grammar n n How do we handle “loose” syntax? Translation story: q q n Generate target English by a monolingual grammar But probabilities are influenced by source sentence Useful analogies: 1. 2. Generative grammar with latent word senses Source MEMM 1. Generate n-gram tag sequence, Target but probabilities are influenced by word sequence Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 22

Quasi-synchronous grammar n n How do we handle “loose” syntax? Translation story: q q n Generate target English by a monolingual grammar But probabilities are influenced by source sentence Useful analogies: 1. 2. 3. Generative grammar with latent word senses MEMM IBM Model 1 1. 2. Source nodes can be freely reused or unused Future work: Enforce 1 -to-1 to allow good decoding (NP-hard to do exactly) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 23

Some results: Quasi-synch. Dep. Grammar n Alignment (D. Smith & Eisner 2006) q q n Quasi-synchronous much better than synchronous Maybe also better than IBM Model 4 Question answering (Wang et al. 2007) q q Align question w/ potential answer Mean average precision 43% 48% 60% n n previous state of the art + QG + lexical features Bootstrapping a parser for a new language q (D. Smith & Eisner 2007 & ongoing) Learn how parsed parallel text influences target dependencies n q Along with many other features! (cf. co-training) Unsupervised: German 30% 69%, Spanish 26% 65% Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 24

Summary of part I n Current practice: q q q Use non-syntactic alignments (Giza++) Some bits align nicely Use the frequent bits in a decoder n Suggestion: Let syntax influence alignments. n So far, loose syntax methods are like IBM Model I. q n NP-hard to enforce 1 -to-1 in any interesting model. Rest of talk: q q How to enforce 1 -to-1 in interesting models? Can we do something smarter than beam search? Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 25

Shuffling Non-Constituents Jason Eisner with David A. Smith syntactically-flavored reordering model and Roy Tromble syntactically-flavored reordering search methods ACL SSST Workshop, June 2008 26

Motivation n MT is really easy! Just use a finite-state transducer! Phrases, morphology, the works! Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 27

Permutation search in MT NNP NEG PRP AUX NEG Marie ne m’ a pas 1 4 2 5 6 3 best order (French’) seen me easy transduction 1 Mary 2 3 hasn’t 4 5 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 VBN 6 initial order vu (French) 28

Motivation n MT is really easy! Just use a finite-state transducer! Phrases, morphology, the works! n Have just to fix that pesky word order. n n Framing it this way lets us enforce 1 -to-1 exactly at the permutation step. Deletion and fertility > 1 are still allowed in the subsequent transduction. Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 29

Often want to find an optimal permutation … n n n Machine translation: Reorder French to French-prime (Brown et al. 1992) So it’s easier to align or translate MT eval: How much do you need to rearrange MT output so it scores well under an LM derived from ref translations? Discourse generation, e. g. , multi-doc summarization: Order the output sentences (Lapata 2003) So they flow nicely Reconstruct temporal order of events after info extraction Learn rule ordering or constraint ranking for phonology? Multi-word anagrams that score well under a LM Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 30

Permutation search: The problem 1 2 3 4 5 6 1 4 2 5 6 3 initial order best order according to How can we find this needle some cost in the haystack of N! function possible permutations? Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 32

Traditional approach: Beam search Approx. best path through a really big FSA N! paths: one for each permutation only 2 N states state remembers what we’ve generated so far (but not in what order) arc weight = cost of picking 5 next if we’ve seen {1, 2, 4} so far Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 33

An alternative: Local search (“hill climbing”) The SWAP neighborhood 132456 213456 cost=20 cost=26 123456 cost=22 124356 cost=19 123546 cost=25 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 34

An alternative: Local search (“hillclimbing”) The SWAP neighborhood 123456 cost=22 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 124356 cost=19 35

An alternative: Local search (“hillclimbing”) The SWAPofneighborhood Like “greedy decoder” Germann et al. 2001 1 2 3 4 5 6 cost=22 cost=19 cost=17 cost=16. . . we pick best swap Why are the costs always going down? How long does it take to pick best swap? O(N) if you’re careful O(N 2) How many swaps might you need to reach answer? random restarts What if you get stuck in a local min? Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 36

Larger neighborhood 132456 213456 cost=20 cost=26 123456 cost=22 124356 cost=19 123546 cost=25 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 37

Larger neighborhood (well-known in the literature; reportedly works well) INSERT neighborhood 1 2 3 4 5 6 cost=22 cost=17 Fewer local minima? yes – 3 can move past 4 to get past 5 Graph diameter (max #moves needed)? O(N) rather than O(N 2) rather than O(N) How many neighbors? O(N 2) rather than O(N) How long to find best neighbor? Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 38

Even larger neighborhood BLOCK neighborhood 1 2 3 4 5 6 cost=22 cost=14 yes – 2 can get past 45 without having to cross 3 or move 3 first Fewer local minima? still O(N) Graph diameter (max #moves needed)? O(N 3) rather than O(N), O(N 2) How many neighbors? How long to find best neighbor? O(N 3) rather than O(N), O(N 2) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 39

Larger yet: Via dynamic programming? ? 1 2 3 4 5 6 Fewer local minima? Graph diameter (max #moves needed)? How many neighbors? How long to find best neighbor? Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 cost=22 logarithmic exponential polynomial 40

Unifying/generalizing neighborhoods so far 1 i 2 3 j 4 5 6 7 k 8 Exchange two adjacent blocks, of max widths w ≤ w’ Move is defined by an (i, j, k) triple SWAP: w=1, w’=1 INSERT: w=1, w’=N BLOCK: w=N, w’=N runtime = # neighbors = O(ww’N) O(N) everything in this talk can be generalized to O(N 2) other values of w, w’ O(N 3) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 41

Very large-scale neighborhoods n What if we consider multiple simultaneous exchanges that are “independent”? 1 n 3 2 5 4 6 The DYNASEARCH neighborhood (Potts & van de Velde 1995; Congram 2000) 22 1 11 4 3 2 3 3 4 2 55 6 Lowest-cost neighbor is lowest-cost path 5 6 5 44 Cost of this arc is Δcost of swapping (4, 5), here < 0 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 42

Very large-scale neighborhoods 2 1 1 4 2 3 6 4 2 3 n 3 Lowest-cost neighbor is lowest-cost path 5 5 6 4 5 Why would this be a good idea? Help get out of bad local minima? no; they’re still local minima yes – less greedy Help avoid getting into bad local minima? 0 B= -20 0 80 0 0 -30 -0 0 -20 0 0 2 1 1 2 3 4 3 3 4 2 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 DYNASEARCH (-20+-20) SWAP (-30) 43

Very large-scale neighborhoods 2 1 1 2 3 n 4 3 3 4 2 5 6 5 5 6 Lowest-cost neighbor is lowest-cost path 4 Why would this be a good idea? Help get out of bad local minima? no; they’re still local minima yes – less greedy Help avoid getting into bad local minima? More efficient? yes! – shortest-path algorithm finds the best set of swaps in O(N) time, as fast as best single swap. Up to N moves as fast as 1 move: no penalty for “parallelism”! Globally optimizes over exponentially many neighbors (paths). Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 44

Can we extend this idea – up to N moves in parallel by dynamic programming – to neighborhoods beyond SWAP? 1 i 2 3 j 4 5 6 7 k 8 Exchange two adjacent blocks, of max widths w ≤ w’ Move is defined by an (i, j, k) triple SWAP: w=1, w’=1 INSERT: w=1, w’=N BLOCK: w=N, w’=N runtime = # neighbors = O(ww’N) O(N) Yes. 2) O(N Asymptotic runtime is always unchanged. O(N 3) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 45

Let’s define each neighbor by a “colored tree” Just like ITG! = swap children 1 2 3 4 5 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 6 46

Let’s define each neighbor by a “colored tree” Just like ITG! = swap children 1 2 3 4 5 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 6 47

Let’s define each neighbor by a “colored tree” Just like ITG! = swap children 5 6 1 2 3 4 This is like the BLOCK neighborhood, but with multiple block exchanges, which may be nested. Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 48

If that was the optimal neighbor … … now look for its optimal neighbor new tree! 5 6 1 4 2 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 3 49

If that was the optimal neighbor … … now look for its optimal neighbor new tree! 5 6 1 4 2 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 3 50

If that was the optimal neighbor … … now look for its optimal neighbor … repeat till reach local optimum Each tree defines a neighbor. At each step, optimize over all possible trees by dynamic programming (CKY parsing). 1 4 2 5 6 3 Use your favorite parsing speedups (pruning, best-first, …) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 51

Very-large-scale versions of SWAP, INSERT, and BLOCK all by the algorithm we just saw … 1 i 2 3 j 4 5 6 7 k 8 Exchange two adjacent blocks, of max widths w ≤ w’ Move is defined by an (i, j, k) triple Runtime of the algorithm we just saw was O(N 3) because we considered O(N 3) distinct (i, j, k) triples More generally, restrict to only the O(ww’N) triples of interest to define a smaller neighborhood with runtime of O(ww’N). (yes, the dynamic programming recurrences go through) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 52

How many steps to get from here to there? 6 2 5 8 4 3 7 1 initial order One twisted-tree step? No: As you probably know, 3 1 4 2 1 2 3 4 is impossible. 1 2 3 4 5 6 7 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 8 best order 53

Can you get to the answer in one step? German-English, Giza++ alignment not always (yay, local search) often (yay, big neighborhood) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 54

How many steps to the answer in the worst case? (what is diameter of the search space? ) 6 2 5 8 4 3 7 1 claim: only log 2 N steps at worst (if you know where to step) Let’s sketch the proof! 1 2 3 4 5 6 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 7 8 55

Quicksort anything into, e. g. , 1 2 3 4 5 6 7 8 right-branching tree 6 2 5 4 8 4 7 3 1 5 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 56

Quicksort anything into, e. g. , 1 2 3 4 5 6 7 8 Only log 2 N steps to get to 1 2 3 4 5 6 7 8 … … orsequence to anywhere! of right-branching trees 2 4 2 4 3 3 1 7 8 5 6 5 6 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 7 57

Defining “best order” What class of cost functions can we handle efficiently? How fast can we compute a subtree’s cost from its child subtrees? 1 2 3 4 5 6 1 4 2 5 6 3 initial order best order according to How can we find this needle some cost in the haystack of N! function possible permutations? Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 58

Defining “best order” 0 -30 What class of cost functions? A = n erso p s e l a S g ) n i l P e S “Trav blem” (T Pro 15 12 7 6 15 22 80 0 -76 24 63 -44 0 -15 71 -99 28 8 -31 5 -7 0 54 -6 41 24 0 82 5 -22 8 93 0 -9 a 14 + a 42 + a 25 + a 56 + a 63 + a 31 1 4 2 5 6 3 best order according to How can we find this needle some cost in the haystack of N! function possible permutations? Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 59

Defining “best order” 0 93 8 6 8 -31 -6 54 0 24 82 88 17 -6 0 12 g n i r e 11 -17 10 -59 0 d r O r a ) e P n i 5 4 -12 6 55 O “L L ( ” m e l b o b 26 = cost of 2 preceding 6 Pr -60 What class of cost functions? B= 5 -22 12 0 -7 41 -9 23 0 (add up n(n-1)/2 such costs) (any order will incur either b 26 or b 62) 1 4 2 5 6 3 best order according to How can we find this needle some cost in the haystack of N! function possible permutations? Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 60

Defining “best order” What class of cost functions? n n TSP and LOP are both NP-complete In fact, believed to be inapproximable q n hard even to achieve C * optimal cost (any C≥ 1) Practical approaches: Ø correct answer, typically fast branch-and-bound, Ø fast answer, typically close to correct beam search, ILP, … this talk, … Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 61

Defining “best order” What class of cost functions? 1 2 3 4 5 6 initial order 1 4 2 5 6 3 cost of this order: 4 1… 2… 3? before 3 …? Generalizes TSP 1. Does my favorite WFSA like this string of #s? 2. Non-local pair order ok? 3. Non-local triple order ok? 4. Can add these all up … LOP Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 63

Costs are derived from source sentence features NNP 1 NEG PRP 2 Marie AUX 4 3 ne m’ a 5 -30 A= 15 12 7 6 15 0 24 63 -44 12 0 0 -15 71 -99 -7 88 80 0 -76 8 -31 5 0 54 -6 41 24 0 82 5 -22 8 93 0 -9 B= (French) vu -7 22 28 6 pas ne would like to be brought adjacent to the next NEG word 0 VBN initial order NEG 5 -22 93 8 6 8 -31 -6 54 41 0 -9 24 82 17 -6 0 11 -17 -75 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 10 -59 4 -12 6 12 -60 0 23 55 0 64

Costs are derived from source sentence features NNP 1 NEG 2 Marie -30 15 12 7 6 15 AUX 4 3 ne 0 A= PRP m’ a 22 80 0 -76 24 63 -44 0 -15 71 -99 28 8 -31 5 -7 0 54 -6 41 24 0 82 5 -22 8 93 0 -9 VBN initial order NEG 5 6 pas (French) vu 50: a verb (e. g. , vu) shouldn’t precede its subject (e. g. , Marie) 0 5 a-22 93 8 56 +27: words at distance of shouldn’t 12 swap 0 order 8 -31 -6 54 -2: words -7 with 41 PRP 0 between -9 24 82 them ought to swap 88 17 -6 0 12 -60 … = 75 B= 11 -17 75 10 -59 4 -12 6 0 23 55 0 Can also include phrase boundary symbols in the input! Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 65

Costs are derived from source sentence features NNP 1 NEG PRP 2 Marie AUX 4 3 ne m’ a VBN initial order NEG 5 6 pas (French) vu FSA costs: Distortion model Language model – looks ahead to next step! ( good finite-state translation into good English? ) 0 -30 A= 15 12 7 6 15 -7 0 24 63 -44 12 0 0 -15 71 -99 -7 88 22 80 0 -76 28 8 -31 5 0 54 -6 41 24 0 82 5 -22 8 93 0 -9 B= 5 -22 93 8 6 8 -31 -6 54 41 0 -9 24 82 17 -6 0 11 -17 75 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 10 -59 4 -12 6 12 -60 0 23 55 0 66

Dynamic program must pick the tree that leads to the lowest-cost permutation 1 2 3 4 5 6 initial order 1 4 2 5 6 3 cost of this order: 1. Does my favorite WFSA like it as a string? Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 67

Scoring with a weighted FSA This particular WFSA implements TSP scoring for N=3: After you read 1, you’re in state 1 After you read 2, you’re in state 2 After you read 3, you’re in state 3 … and this state determines the cost of the next symbol you read nitial We’ll handle a WFSA with Q states by using a fancier grammar, with nonterminals. (Now runtime goes up to O(N 3 Q 3) …) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 68

Including WFSA costs via nonterminals A possible preterminal for word 2 is an arc in A that’s labeled with 2. 4 2 2 The preterminal 4 2 rewrites as word 2 with a cost equal to the arc’s cost. 6 1 1 4 2 2 3 1 4 4 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 I 5 5 5 6 6 69

5 I 6 5 6 1 1. 4 4 2 2 3 3 Including WFSA costs via nonterminals This constituent’s total cost is the total cost of the best 6 3 path I 3 6 1 1 4 4. 2 2 3 cost of the new permutation 3 6 3 1 3 I 6 6 1 1 4 3 I 6 4 2 2 3 1 4 4 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 I 5 5 5 6 6 70

Dynamic program must pick the tree that leads to the lowest-cost permutation 1 2 3 4 5 6 initial order 1 4 2 5 6 3 cost of this order: 4 before 3 …? 1. Does my favorite WFSA like it as a string? 2. Non-local pair order ok? Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 71

Incorporating the pairwise ordering costs This puts {5, 6, 7} before {1, 2, 3, 4}. 1 2 3 4 5 6 So this hypothesis must add costs 5 < 1, 5 < 2, 5 < 3, 5 < 4, 6 < 1, 6 < 2, 6 < 3, 6 < 4, 7 < 1, 7 < 2, 7 < 3, 7 < 4 7 Uh-oh! So now it takes O(N 2) time to combine two subtrees, instead of O(1) time? Nope – dynamic programming to the rescue again! Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 72

Computing LOP cost of a block 1 2 3 4 move This puts {5, 6, 7} before {1, 2, 3, 4}. 5 revise So we have to add O(N 2) costs just to consider this single neighbor! 1 2 3 4 5 6 = 5 6 7 Reuse work from other, “narrower” block moves … computed new cost in O(1)! 7 1 2 3 4 + 5 6 7 1 2 3 4 - 5 6 7 1 2 3 4 5 +6 7 already computed at earlier steps Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 of parsing 73

Incorporating 3 -way ordering costs n See the initial paper (Eisner & Tromble 2006) n A little tricky, but q q comes “for free” if you’re willing to accept a certain restriction on these costs more expensive without that restriction, but possible Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 74

Another option: Markov chain Monte Carlo n Random walk in the space of permutations q interpret a permutation’s cost as a log-probability n Sample a permutation from the neighborhood instead of always picking the most probable n Why? q q Simulated annealing might beat greedy-with-random-restarts When learning the parameters of the distribution, can use sampling to compute the feature expectations Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 75

Another option: Markov chain Monte Carlo n Random walk in the space of permutations q interpret a permutation’s cost as a log-probability n Sample a permutation from the neighborhood instead of always picking the most probable n How? q Pitfall: Sampling a permutation sampling a tree n q Spurious ambiguity: some permutations have many trees Solution: Exclude some trees, leaving 1 permutation n n Normal form has long been known for colored trees For restricted colored trees (which limit the size of blocks to swap), we have devised a more complicated normal form Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 76

Learning the costs n n Where do these costs come from? If we have some examples on which we know the true permutation, could try to learn them 0 -30 A= 15 12 7 6 15 -7 0 24 63 -44 12 0 0 -15 71 -99 -7 88 22 80 0 -76 28 8 -31 5 0 54 -6 41 24 0 82 5 -22 8 93 0 -9 B= 5 -22 93 8 6 8 -31 -6 54 41 0 -9 24 82 17 -6 0 11 -17 75 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 10 -59 4 -12 6 12 -60 0 23 55 0 78

Learning the costs n n n Where do these costs come from? If we have some examples on which we know the true permutation, could try to learn them More precisely, try to learn these weights θ (the knowledge that’s reused across examples) 0 -30 A= 15 12 7 6 15 22 80 0 -76 24 63 -44 0 -15 71 -99 28 8 -31 5 -7 0 54 -6 41 24 0 82 5 -22 8 93 0 -9 50: a verb (e. g. , vu) shouldn’t precede its subject (e. g. , Marie) 0 at 5 a -22 93 of 8 5 6 27: words distance shouldn’t 12 swap 0 order 8 -31 -6 54 -2: words -7 with 41 PRP 0 between -9 24 82 them ought to swap 88 17 -6 0 12 -60 … B= 11 -17 75 Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 10 -59 4 -12 6 0 23 55 0 79

Experimenting with training LOP params (LOP is quite fast: O(n 3) with no grammar constant) PDS VMFIN PPER ADV Das kann ich so APPR ART NN PTKNEG VVINF $. aus dem Stand nicht sagen. B[7, 9] Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 83

LOP feature templates Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 84

LOP feature templates n n n Only LOP features so far And they’re unnecessarily simple (don’t examine syntactic constituency) And input sequence is only words (not interspersed with syntactic brackets) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 85

Learning LOP Costs for MT (interesting, if odd, to try to reorder with only the LOP costs) MOSES baseline German n LOP German’ English MOSES Define German’ to be German in English word order q To get German’ for training data, use Giza++ to align all German positions to English positions (disallow NULL) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 86

Learning LOP Costs for MT (interesting, if odd, to try to reorder with only the LOP costs) MOSES baseline German n LOP German’ English MOSES Easy first try: Naïve Bayes q q q Treat each feature in θ as independent Count and normalize over the training data No real improvement over baseline Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 87

Learning LOP Costs for MT (interesting, if odd, to try to reorder with only the LOP costs) MOSES baseline German n LOP German’ English MOSES Easy second try: Perceptron search. . . local optimum error update el d o m or r r e gold standard global optimum Note: Search error can be beneficial, e. g. , just take 1 step from identity permutatio Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 88

Benefit from reordering Learning method BLEU vs. German′ English No reordering 49. 65 Naïve Bayes—POS 49. 21 Naïve Bayes—POS+lexical 49. 75 Perceptron—POS 50. 05 25. 92 Perceptron—POS+lexical 51. 30 26. 34 25. 55 obviously, not yet unscrambling German: need more features Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 90

Alternatively, work back from gold standard n Contrastive estimation (Smith & Eisner 2005) 1 -step verylarge-scale neighborhood n gold standard Maximize the probability of the desired permutation relative to its ITG neighborhood n Requires summing all permutations in a neighborhood q n Must use normal-form trees here Stochastic gradient descent Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 91

Alternatively, work back from gold standard n k-best MIRA in the neighborhood 1 -step verylarge-scale neighborhood n n gold standard current winners in the neighborhood Make gold standard beat its local competitors Beat the bad ones by a bigger margin q q q Good = close to gold in swap distance? Good = close to gold using BLEU? Good = translates into English that’s close to reference? Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 92

Alternatively, train each iterate model best in neigh of (0) . . . update oracle in neigh of (0) n Or could do a k-best MIRA version of this, too; even use a loss measure based on lookahead to (n) Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 93

Summary of part II n Local search is fun and easy q q n Probably useful for translation q n Popular elsewhere in AI Closely related to MCMC sampling Maybe other NP-hard problems too Can efficiently use huge local neighborhoods q q Algorithms are closely related to parsing and FSMs Our community knows that stuff better than anyone! Eisner, D. A. Smith, Tromble - SSST Workshop - June 2008 95