Structural Phrase Alignment Based on Consistency Criteria Core
Structural Phrase Alignment Based on Consistency Criteria Core Steps of Alignment Flow of Our EBMT System • Searching Correspondence Candidates Translation Examples Input 交差 (cross) 交差点に入る時 私の信号は青でした。 交差 (cross) 点に (point) at me 突然 (suddenly) The light to remove 入る (enter) 時 (when) 二百十六万 → 2, 160, 000 ← 2. 16 million entering a house 私 の (my) • Numeral normalization when entering 私 の (my) ローズワイン → rosuwain ⇔ rose wine (similarity: 0. 78) 新宿 → shinjuku ⇔ shinjuku (similarity: 1. 0) was green when 脱ぐ (put off) • Bilingual dictionaries • Transliteration (Katakana words, NEs) traffic at the intersection 家に (house) 信号 は (signal) 青 (blue) でした 。 (was) my from the side 飛び出して 来た のです 。 (rush out) 入る (enter) 時 (when) – Fine alignment is efficient in translation – Search candidates as much as possible using variety of linguistic information came 点 で 、(point) • Japanese flexible matching (Odani et. al. 2007) • Substring co-occurrence measure (Cromieres 2006) the intersection my signature サイン (signal) 信号 は (signal) 青 (blue) でした 。 (was) Language Models traffic Output My traffic light was green when entering the intersection. The light was green Toshiaki Nakazawa, Kun Yu, Sadao Kurohashi (Graduate School of Informatics, Kyoto University) {nakazawa, kunyu}@nlp. kuee. kyoto-u. ac. jp kuro@i. kyoto-u. ac. jp • Selecting Correspondence Candidates – More candidates derive more ambiguities and improper alignments – Necessity of robust alignment method which can align parallel sentences consistently by selecting the adequate candidates set Selecting Correspondence Candidates Using Consistency Score and Dependency Type Ambiguities! 日本 で you (in Japan) Near! will have to file 保険 (insurance) insurance 会社 に 対して Far! (to company) Far! an claim 保険 (insurance) 1/1+1/2=1. 5 insurance 請求 の (claim) with the office 申し立て が Near! (instance) 可能ですよ (you can) Improper alignments! baseline in Japan How to reflect the inconsistency? Japanese J-Side Distance E-Side Distance Consistency Score predicate: level C 6 S / SBAR / SQ … 5 predicate: level B+/B 5 VP / WHADVP 4 predicate: level B-/A 4 WHADJP case no / rentai 2 Inside clause 1 ADVP / ADJP NP / PP / INTJ Others 3 QP / PRT / PRN predicate: level A- Frequency (log) 3 Others 1 Dependency Type Distance 3 デ格 日本 で [case “de”] Dist of E-Side Distribution of the distance of alignment pairs in hand-annotated data (Mainichi newspaper 40 K sentence pairs) [Uchimoto 04] 保険 1 [inside clause] 文節内 2 ノ格 [case “ga”] (instance) 可能です よ J-Side Distance Experimental Result 1 NN insurance 3 NP an claim 保険 1 NN (claim) Pair 2: (Ds, Dt) = (1, 7) Negative Score insurance 3 PP with the office 3 PP in Japan Quality of Other Language Pairs 500 test sentences from Mainichi newspaper parallel corpus Bilingual dictionary: KENKYUSYA J-E/J-E 500 K entries Evaluation criteria: Precision / Recall / F-measure Character-base for Japanese, word-base for English Rec 64. 32 66. 90 69. 14 71. 31 33. 15 89. 80 (you can) will have to file 請求 の 3 ガ格 申し立て が E-Side Distance 3 NP you (insurance) [case “no”] Pre 77. 47 80. 30 80. 77 82. 48 60. 19 95. 58 Pair 1: (Ds, Dt) = (1, 1) Positive Score [renyou] Consistency Score Function * Using 300 K newspaper domain bi-sentences for training (insurance) [inside clause] 3 に 対して 連用 会社 (to company) Score “Near-Near” pair → Positive Score “Far-Far” pair → 0 “Near-Far” pair → Negative Score Baseline +Consistency Score Proposed(+CS, +Dpnd. Type) Filtering (80%) Moses (SMT Toolkit)* Manual (upper bound) (in Japan) 1 文節内 Dist of J-Side • • English F 70. 29 72. 99 74. 51 76. 49 42. 75 92. 60 HLT-NAACL 2003 ACL 2005 (Gildea, 2003) GIZA++ English. French 5. 71 15. 89 English. Romanian 28. 86 26. 55 27. 19 English. Korean 32 35 (AER) Conclusion • • Proposed a new phrase alignment method using consistency criteria. Enough alignment accuracy compared to other language pairs. We need to acquire the parameters automatically by machine learning. We are planning to evolve the framework which revises the parse result. (There is a translation demos in exhibition corner by NICT which is using our system!)
- Slides: 1