Large Scale Machine Translation Architectures Qin Gao Outline

Outline Typical Problems in Machine Translation Program Model for Machine Translation Map. Reduce Required

Why large scale MT We need more data. . But… 2020/9/10 Qin Gao, LTI,

Some representative MT problems Counting events in corpora ◦ Ngram count Sorting ◦ Phrase

Characteristics of different tasks Counting events in corpora ◦ Extract knowledge from data Sorting

Components required for large scale MT Data Knowledge 2020/9/10 Qin Gao, LTI, CMU 6

Components required for large scale MT Data Knowledge 2020/9/10 Qin Gao, LTI, CMU 7

Components required for large scale MT Data Stream Data Processor Structured Knowledge 2020/9/10 Qin

Problem for each component Stream data: ◦ As the amount of data grows, even

Make it simple: What is the underlying problem? We have a huge cake and

Parallelization Data Knowledge 2020/9/10 Qin Gao, LTI, CMU 11

Solutions Large-scale distributed processing ◦ Map. Reduce: Simplified Data Processing on Large Clusters, Jeffrey

Map. Reduce can refer to ◦ A programming model that deal with massive, unordered,

Map. Reduce programming model Abstracts the computation into two functions: ◦ MAP ◦ Reduce

Representation of data The streaming data is abstracted as a sequence of key/value pairs

Map function The Map function takes an input key/value pair, and output a set

Reduce function accepts one intermediate key and a set of intermediate values, and produce

The architecture of Map. Reduce Map function Reduce Function Distributed Sort 2020/9/10 Qin Gao,

Benefit of Map. Reduce Automatic splitting data Fault tolerance High-throughput computing, uses the nodes

Requirement for expressing algorithm in Map. Reduce Process Unordered data ◦ The data must

Example Distributed ◦ ◦ ◦ Input key : word Input value : 1 Intermediate

Example 2 Distributed unigram count ◦ ◦ Input key : Document/Sentence ID Input value

Example 3 Distributed Sort ◦ Input key : Entry key ◦ Input value :

Supporting Map. Reduce: Distributed Storage Reminder what we are dealing with in Map. Reduce:

Design principle of Google FS Optimizing for special workload: ◦ Large streaming reads, small

Google FS Architecture Optimize for large streaming reading and large, concurrent writing Small random

Google FS architecture 2020/9/10 Qin Gao, LTI, CMU 27

Replication When a chunk is frequently or “simultaneously” read from a client, the client

HDFS shares similar design principle of Google FS Write-once-read-many : Can only write file

Are we done? NO… Problems about the existing architecture 2020/9/10 Qin Gao, LTI, CMU

We are good at dealing with data What about knowledge? I. E. structured data?

A good example: GIZA A typical EM algorithm World Alignment Collect Counts Y Has

When parallelized: seems to be a perfect Map. Reduce application Word Alignment Collect Counts

However: Memory Large parallel corpus … Corpus chunks Map Reduc e Count tables .

Huge tables Lexicon probability table: T- Table Up to 3 G in early stages

Another example, decoding Consider language models, what can we do if the language model

Google Language Model Storage: ◦ Central storage or distributed storage How to deal with

Again, made in Google: Bigtable It is the specially optimized for structured data Serving

Data model in Bigtable Four dimension table: ◦ Row ◦ Column family ◦ Column

Distributed storage unit : Tablet A tablet consists a range of rows Tablets can

Random access unit : Column family Each tablet is a string-to-string map (Though not

Tables inside table: Column and Timestamp Column can be any arbitrary string value Timestamp

Performance Number second. What of 1000 -byte values read/write per is shocking: ◦ Effective

An example : Phrase Table Row: First bigram/trigram of the source phrase Column Family:

Benefit Different source phrase comes from different servers The load is balanced and the

Another Example: GIZA++ Lexicon table: ◦ Row: Source word id ◦ Column Family: nothing

Conclusion Strangely, the talk is all about how Google does it A useful framework

Open Source Alternatives Map. Reduce Library Hadoop Google. FS Hadoop FS (HDFS) Big. Table

THANK YOU! 2020/9/10 Qin Gao, LTI, CMU 49

Slides: 49

Download presentation

Large Scale Machine Translation Architectures Qin Gao

Outline Typical Problems in Machine Translation Program Model for Machine Translation Map. Reduce Required System Component Supporting software Distributed streaming data storage system Distributed structured data storage system Integrating system – How to make a full-distributed 2020/9/10 Qin Gao, LTI, CMU 2

Why large scale MT We need more data. . But… 2020/9/10 Qin Gao, LTI, CMU 3

Some representative MT problems Counting events in corpora ◦ Ngram count Sorting ◦ Phrase table extraction Preprocessing Data ◦ Parsing, tokenizing, etc Iterative optimization ◦ GIZA++ (All EM algorithms) 2020/9/10 Qin Gao, LTI, CMU 4

Characteristics of different tasks Counting events in corpora ◦ Extract knowledge from data Sorting ◦ Process data, knowledge is inside data Preprocessing Data ◦ Process data, require external knowledge Iterative optimization ◦ For each iteration, process data using existing knowledge and update knowledge 2020/9/10 Qin Gao, LTI, CMU 5

Components required for large scale MT Data Knowledge 2020/9/10 Qin Gao, LTI, CMU 6

Components required for large scale MT Data Knowledge 2020/9/10 Qin Gao, LTI, CMU 7

Components required for large scale MT Data Stream Data Processor Structured Knowledge 2020/9/10 Qin Gao, LTI, CMU 8

Problem for each component Stream data: ◦ As the amount of data grows, even a complete navigation is impossible. Processor: ◦ Single processor’s computation power is not enough Knowledge: ◦ The size of the table is too large to fit into memory ◦ Cache-based/distributed knowledge base suffers from low speed 2020/9/10 Qin Gao, LTI, CMU 9

Make it simple: What is the underlying problem? We have a huge cake and we want to cut them into pieces and eat. Different cases: ◦ We just need to eat the cake. ◦ We also want to count how many peanuts inside the cake ◦ (Sometimes)We have only one folk! 2020/9/10 Qin Gao, LTI, CMU 10

Parallelization Data Knowledge 2020/9/10 Qin Gao, LTI, CMU 11

Solutions Large-scale distributed processing ◦ Map. Reduce: Simplified Data Processing on Large Clusters, Jeffrey Dean, Sanjay Ghemawat, Communications of the ACM, vol. 51, no. 1 (2008), pp. 107 -113. Handling huge streaming data ◦ The Google File System, Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung, Proceedings of the 19 th ACM Symposium on Operating Systems Principles, 2003, pp. 20 -43. Handling structured data ◦ Large Language Models in Machine Translation, Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, Jeffrey Dean, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-Co. NLL), pp. 858 -867. ◦ Bigtable: A Distributed Storage System for Structured Data, Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber, 7 th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2006, pp. 205 -218. 2020/9/10 Qin Gao, LTI, CMU 12

Map. Reduce can refer to ◦ A programming model that deal with massive, unordered, streaming data processing tasks(MUD) ◦ A set of supporting software environment implemented by Google Inc Alternative implementation: ◦ Hadoop by Apache fundation 2020/9/10 Qin Gao, LTI, CMU 13

Map. Reduce programming model Abstracts the computation into two functions: ◦ MAP ◦ Reduce User is responsible for the implementation of the Map and Reduce functions, and supporting software take care of executing them 2020/9/10 Qin Gao, LTI, CMU 14

Representation of data The streaming data is abstracted as a sequence of key/value pairs Example: ◦ (sentence_id : sentence_content) 2020/9/10 Qin Gao, LTI, CMU 15

Map function The Map function takes an input key/value pair, and output a set of intermediate key/value pairs Key 1 : Value 1 Map() Key 1 : Value 1 Key 2 : Value 2 Key 3 : Value 3 ……. . Key 2 : Value 2 Map() Key 1 : Value 2 Key 2 : Value 1 Key 3 : Value 3 ……. . 2020/9/10 Qin Gao, LTI, CMU 16

Reduce function accepts one intermediate key and a set of intermediate values, and produce the result Key 1 : Value 1 Key 1 : Value 2 Key 1 : Value 3 Reduce() Result ……. . Key 2 : Value 1 Key 2 : Value 2 Key 2 : Value 3 ……. . 2020/9/10 Qin Gao, LTI, CMU 17

The architecture of Map. Reduce Map function Reduce Function Distributed Sort 2020/9/10 Qin Gao, LTI, CMU 18

Benefit of Map. Reduce Automatic splitting data Fault tolerance High-throughput computing, uses the nodes efficiently Most important: Simplicity, just need to convert your algorithm to the Map. Reduce model. 2020/9/10 Qin Gao, LTI, CMU 19

Requirement for expressing algorithm in Map. Reduce Process Unordered data ◦ The data must be unordered, which means no matter in what order the data is processed, the result should be the same Produce Independent intermediate key ◦ Reduce function can not see the value of other keys 2020/9/10 Qin Gao, LTI, CMU 20

Example Distributed ◦ ◦ ◦ Input key : word Input value : 1 Intermediate key : constant Intermediate value: 1 Reduce() : Count all intermediate values Distributed ◦ ◦ Word Count (1) Word Count (2) Input key : Document/Sentence ID Input value : Document/Sentence content Intermediate key : constant Intermediate value: number of words in the document/sentence ◦ Reduce() : Count all intermediate values 2020/9/10 Qin Gao, LTI, CMU 21

Example 2 Distributed unigram count ◦ ◦ Input key : Document/Sentence ID Input value : Document/Sentence content Intermediate key : Word Intermediate value: Number of the word in the document/sentence ◦ Reduce() : Count all intermediate values 2020/9/10 Qin Gao, LTI, CMU 22

Example 3 Distributed Sort ◦ Input key : Entry key ◦ Input value : Entry content ◦ Intermediate key : Entry key (modification may be needed for ascend/descend order) ◦ Intermediate value: Entry content ◦ Reduce() : All the entry content Making use of built-in sorting functionality 2020/9/10 Qin Gao, LTI, CMU 23

Supporting Map. Reduce: Distributed Storage Reminder what we are dealing with in Map. Reduce: ◦ Massive, unordered, streaming data Motivation: ◦ We need to store large amount of data ◦ Make use of storage in all the nodes ◦ Automatic replication Fault tolerant Avoid hot spots client can read from many servers Google FS and Hadoop FS (HDFS) 2020/9/10 Qin Gao, LTI, CMU 24

Design principle of Google FS Optimizing for special workload: ◦ Large streaming reads, small random reads ◦ Large streaming writes, rare modification Support concurrent appending ◦ It actually assumes data are unordered High sustained bandwidth is more important than low latency, fast response time is not important Fault tolerant 2020/9/10 Qin Gao, LTI, CMU 25

Google FS Architecture Optimize for large streaming reading and large, concurrent writing Small random reading/writing is also supported, but not optimized Allow appending to existing files File are spitted into chunks and stored in several chunk servers A master is responsible for storage and query of chunk 2020/9/10 Qin Gao, LTI, CMU 26

Google FS architecture 2020/9/10 Qin Gao, LTI, CMU 27

Replication When a chunk is frequently or “simultaneously” read from a client, the client may fail A fault in one client may cause the file not usable Solution: store the chunks in multiple machines. The number of replica of each chunk : replication factor 2020/9/10 Qin Gao, LTI, CMU 28

HDFS shares similar design principle of Google FS Write-once-read-many : Can only write file once, even appending is now allowed “Moving computation is cheaper than moving data” 2020/9/10 Qin Gao, LTI, CMU 29

Are we done? NO… Problems about the existing architecture 2020/9/10 Qin Gao, LTI, CMU 30

We are good at dealing with data What about knowledge? I. E. structured data? What if the size of the knowledge is HUGE? 2020/9/10 Qin Gao, LTI, CMU 31

A good example: GIZA A typical EM algorithm World Alignment Collect Counts Y Has More Iterations ? Has More Sentence s? Y N Normalize Counts N 2020/9/10 Qin Gao, LTI, CMU 32

When parallelized: seems to be a perfect Map. Reduce application Word Alignment Collect Counts Has More Sentence s? N Y Has More Iterations ? Y Has More Sentence s? Y N Normalize Counts N Run on cluster 2020/9/10 Qin Gao, LTI, CMU 33

However: Memory Large parallel corpus … Corpus chunks Map Reduc e Count tables . . . . Combined count table . . . Data I/O Memory Renormalization Statistical lexicon Redistribute for next iteration . . . 2020/9/10 Qin Gao, LTI, CMU 34

Huge tables Lexicon probability table: T- Table Up to 3 G in early stages As the number of workers increases, they all need to load this 3 G file! And all the nodes need to have 3 G+ memory – we need a cluster of super computers? 2020/9/10 Qin Gao, LTI, CMU 35

Another example, decoding Consider language models, what can we do if the language model grows to several TBs We need storage/query mechanism for large, structured data Consideration: ◦ Distributed storage ◦ Fast access: network has high latency 2020/9/10 Qin Gao, LTI, CMU 36

Google Language Model Storage: ◦ Central storage or distributed storage How to deal with latency? ◦ Modify the decoder, collect a number of queries and send them in one time. It is a specific application, we still need something more general. 2020/9/10 Qin Gao, LTI, CMU 37

Again, made in Google: Bigtable It is the specially optimized for structured data Serving many applications now It is not a complete database Definition: ◦ A Bigtable is a sparse, distributed, persistent, multi-dimensional, sorted map 2020/9/10 Qin Gao, LTI, CMU 38

Data model in Bigtable Four dimension table: ◦ Row ◦ Column family ◦ Column ◦ Timestamp Column family Column Row Timestamp 2020/9/10 Qin Gao, LTI, CMU 39

Distributed storage unit : Tablet A tablet consists a range of rows Tablets can be stored in different nodes, and served by different servers Concurrent reading multiple rows can be fast 2020/9/10 Qin Gao, LTI, CMU 40

Random access unit : Column family Each tablet is a string-to-string map (Though not mentioned, the API shows that: ) In the level of column family, the index is loaded into memory so fast random access is possible Column family should be fixed 2020/9/10 Qin Gao, LTI, CMU 41

Tables inside table: Column and Timestamp Column can be any arbitrary string value Timestamp is an integer Value is byte array Actually it is a table of tables 2020/9/10 Qin Gao, LTI, CMU 42

Performance Number second. What of 1000 -byte values read/write per is shocking: ◦ Effective IO for random read (from GFS) is more than 100 MB/second ◦ Effective IO for random read from memory is more than 3 GB/second 2020/9/10 Qin Gao, LTI, CMU 43

An example : Phrase Table Row: First bigram/trigram of the source phrase Column Family: Length of source phrase or some hashed number of remaining part of source phrase Column: Remaining part of the source phrase Value: All the phrase pairs of the source phrase 2020/9/10 Qin Gao, LTI, CMU 44

Benefit Different source phrase comes from different servers The load is balanced and the reading can be concurrent and much faster. Filtering the phrase table before decoding becomes much more efficient. 2020/9/10 Qin Gao, LTI, CMU 45

Another Example: GIZA++ Lexicon table: ◦ Row: Source word id ◦ Column Family: nothing ◦ Column: Target word id ◦ Value: The probability value With a simple local cache, the table loading can be extremely efficient comparing to current implemenetation 2020/9/10 Qin Gao, LTI, CMU 46

Conclusion Strangely, the talk is all about how Google does it A useful framework for distributed MT systems require three components: ◦ Map. Reduce software ◦ Distributed streaming data storage system ◦ Distributed structured data storage system 2020/9/10 Qin Gao, LTI, CMU 47

Open Source Alternatives Map. Reduce Library Hadoop Google. FS Hadoop FS (HDFS) Big. Table Hyper. Table 2020/9/10 Qin Gao, LTI, CMU 48

THANK YOU! 2020/9/10 Qin Gao, LTI, CMU 49