Parallel Implementation Of Word Alignment Model IBM MODEL

Parallel Implementation Of Word Alignment Model: IBM MODEL 1 Professor: Dr. Azimi Fateme Ahmadi-Fakhr Afshin Arefi Saba Jamalian Dept. of Electrical and Computer Engineering Shiraz University General-purpose Programming of Massively Parallel Graphics Processors 1

Machine Translation Ø Ø Suppose we are asked to translate a foreign sentence f into an English sentence e: What should we do ? � � � f: f 1 … fm e: e 1 … e l For each word in foreign sentence f , we find its most proper word in English. Based on our knowledge in English language , we change the order of generated English words. We might also need to change the words themselves. f 1 f 2 f 3 … f m e 1 e 2 e 3 … e m e 1 e 3 e 2 em+1…el 2

Example Tr an ﺭﻓﺘﻢ ﺑﻪ ﻣﺪﺭﺳﻪ ﺻﺒﺢ ﺍﻣﺮﻭﺯ n o i t sla l n Tra ode M Finding its most proper word in English went school to morning today l ag u ng La de o e. M today this 3 sla morning I went to to school Reordering and Changing the words tio n

Statistical Translation Models tion a l s l n a Tr ode M ﺭﻓﺘﻢ ﺑﻪ ﻣﺪﺭﺳﻪ ﺻﺒﺢ ﺍﻣﺮﻭﺯ Finding its most proper word in English went school to morning today t( go| > )ﺭﻓﺘﻢ t(x| )ﺭﻓﺘﻢ x as all other English words ØThe machine must know t(e|f) for all possible e and f to find the max. ØMachine should be trained: ØIBM Model 1 -5 ØCalculate t(f|e). 4

IBM Models 1 (Brown et. al [1993]) Corpus (Large Body Of Text) 5 Model 1 t(f|e) for all e and f which are in the Corpus

IBM Models 1 (Brown et. al [1993]) Choose initialize value for t(f|e) for all f and e, then repeat the following steps until Convergence: 6

IBM Models 1 (Brown et. al [1993]) The problem is to find t(f|e) for all e and f t(f|e): ------ ------ ei fj How probable it is that fj be the translation of ei 7

IBM Models 1 (Brown et. al [1993]) t(f|e): c(f|e): ei ------ ------ ------ Initialize ------ fj Total(e): 8 - - - ei ∑ of each Row C(f|e) Initialize to Zero

IBM Models 1 (Brown et. al [1993]) In each sentence pair , for each f in foreign sentence, we calculate ∑ t(f|e) for all e in the English sentence , called totals. Suppose we are given : <f(s), e(s)>: < (f 1 f 2 f 3) , ( e 1 e 2 e 3 e 4) > Totals [2]= t(f|e)[1, 2]+t(f|e)[2, 2]+t(f|e)[3, 2]+t(f|e)[4, 2] C(f|e)[1, 2]+=t(f|e)[1, 2]/totals[2] Total_e[1]+= t(f|e)[1, 2]/totals[2] 9

IBM Models 1 (Brown et. al [1993]) After processing all sentence pairs in the corpus, update the value of t(f|e) for all e and f: t(f|e)[i, j] = C(f|e)[i, j]/total(e)[i] Start processing the sentence pairs, Calculating C(f|e) and total(e) using t(f|e) Continue the process until value t(f|e) has converged to a desired value. 10

IBM Model 1 (Psudou Code) Initialization initialize t(f|e) do until converge c(f|e)=0 for all e and f, total(e)=0 for all e, for all sentence pair do total(s, f)=0 for all f, for all f in f(s) do for e in all e(s) do Initialize to zero total(s, f)+=t(f|e) Calculating Totals for each f In f(s) for all e in e(s) do{ for all f in f(s) do c(f|e)+=t(f|e)/total(s, f) Calculating C(f|e) and total(e)+=t(f|e)/total(s, f) for all e do for all f do 11 t(f|e)=c(f|e)/total(e) Updating t(f|e) using C(f|e) and total(e)

Parallelizing IBM Model 1 initialize t(f|e) do until converge c(f|e)=0 for all e and f total(f)=0 for all f for all sentence pair do total(s, f)=0 for all f, for all e in e(s) do for f in all f(s) do{ For each f, e it is independent of others total(s, f)+=t(f|e) The process on each sentence pair is independent of others for all e in e(s) do{ for all f in f(s) do c(f|e)+=t(f|e)/total(s, f) total(f)+=t(f|e)/total(s, f) for all e do for all f do 12 t(f|e)=c(f|e)/total(f) Updating the value of each t(f|e) for all t and f is independent of each other

Initialize t(f|e) Each thread initialize one entry of t(f|e) to a specified value: __global__ void initialize(float* device_t_f_e){ int pos=block. Idx. x*block. Dim. x+thread. Idx. x; device_t_f_e[pos]=(1. 0/NUM_F); device_t_f_e[pos]=(100000/NUM_F); } 13 Underflow is possible

Process Of Each Sentence Pair for all sentence pair do total(s, f)=0 for all f, Using shared memory for all e in e(s) do for f in all f(s) do{ total(s, f)+=t(f|e) for all e in e(s) do{ No use of Reduction. Why? for all f in f(s) do c(f|e)+=t(f|e)/total(s, f) Each Thread Process one Sentence Pair total(f)+=t(f|e)/total(s, f) Use atomic. Add(), as it’s possible that two or more threads add a value to c(f|e) or total(f) simultaneously. It is data dependent. 14

Updating t(f|e) Each thread update one entry of t(f|e) to a specified value And Set one entry of c(f|e) to zero for next iteration __global__ void update (float* device_t_f_e, float* device_count_f_e, float* device_total_f, int block_size, int Col) { int pos=block. Idx. x*block_size+thread. Idx. x; float total=device_total_f[pos/Col]; float count=device_count_f_e[pos]; device_t_f_e[pos]=(100000*count/total); device_count_f_e[pos]=0; } 15 Here, it is not possible to set total(f) to Zero, As there is no synchronization between threads out of a block

Setting total(f) to Zero Each thread set one entry of total(f) to Zero: __global__ void total(float* device_total_f){ int pos=thread. Idx. x+block. Dim. x*block. Idx. x; device_total_f[pos]=0; } 16

Results NUM_F NUM_E #SENTPAIR CPU-Time GPU-Time Speed-Up 2048 512 0. 452049 0. 061639 7. 33 4096 1024 1. 736251 0. 157878 10. 99 4096 2048 1. 857686 0. 157961 11. 76 17

Future Goals � Convergence Condition: � We repeat the iterations of calculating C(f|e) and t(f|e) for 5 times. � But it should be driven from the value of t(f|e). � We wish to add it to our code as it has a capability of parallelization. � It’s just one of IBM Model 1 -5, which are implemented as GIZA++ package. � We 18 wish to parallelize 4 other models.

We Want to Express Our Appreciation to: Dr. Fazly For her useful comments and valuable notifications. Dr. Azimi For his kindness and full support. 19

20