Statistical Machine Translation Raghav Bashyal Statistical Machine Translation
Statistical Machine Translation Raghav Bashyal
Statistical Machine Translation Uses pre-translated text (copora) Compare translated text to original Notice patterns, associate words
SMT Process • Knight – A Statistical Translation Workbook • Basic probabilities – • P(word) Conditional probabilities – P(word | word) • … • Pick the most probable translation
SMT process http: //isoft. postech. ac. kr/research/SMT/images/math. jpg
Project Translate basic text from Spanish to English Test effectiveness with/without hard-coded components (syntax) Specific procedures/algorithms that add speed
Literature Guides on Statistical Machine Translation Most research project follow the same procedure as outlined by Knight • “state of the art” implementation – Google
Literature • NLTK – • UC Berkeley – – • Christina Wallin Modifications Larger corpora more useful Syntax based – – hard-code Higher translation quality when used with SMT
Procedure NLTK – Natural Language Tool. Kit Python Made from Natural Language processing projects Current procedure – read the SMT worksheet Code along with worksheet
Development • Create corpora • Tokenization – • Clean string Probability – P(word) in corpora
Smoothing • Coefficients used to modify probability – – • Large coefficients for trigrams Small for bigrams and single words Normalizes the weight of all the words/phrases – Trigrams are more valuable
Algorithm For translation, IMB Model 3 is used: 1. For each English word ei indexed by i = 1, 2, . . . , 1, choose fertility phi-i with probability n(phi-i | ei) 2. Choose the number phi-0 of "spurious" French words to be generated from e 0 = NULL, using probability p 1 and the sum of fertilities from step 1 3. Let m be the sum of fertilities for all words,
Expected Results Probably will be very basic translation Highlighted errors Usually perform better with “sample” text than “real” text Program should use reference data to find some errors Error frequency plots for certain words Test the effectiveness of adjustments Hard coding, other algorithms
GUI
- Slides: 13