Google Ngram Data analyzer Project and Presentation by

Google N-gram Data analyzer Project and Presentation by, Anagha Dharasurkar Andrew Norgren Premchand Bellamkonda Shruti Pandey Salil Bapat

1 gram and 2 gram data Unigram Cutoff/ Associative Cutoff Disjoint Network Module Disjoint Networks Reverse Network Creator Reverse Networks Network Interface Module Target Query Path Finder Module Output Paths

Reverse Network Building (Salil) Input: 2 Gram data Output: Reverse Network 2 Gram Data 1 Gram Data Reverse Network Disjoint Networks Network Interface Module (Salil) Input: Reverse Network Output: Linked List Distribution of data (Premchand) Input: Unigram Data Output: Array of Unigrams Find Disjoint Network (Sruthi) Input: Array of Unigrams, Bigram data Output: Linked list, files Agglomeration (Andrew) Input: No. of n/w for each processor, Linked lists Output: No. of disjoint n/w Target Query File Path Finder Module (Anagha) Input: Target Query File Output: Query Paths Number of Edges (Premchand) Input: Number of networks Output: Number of Edges Query Path, Number of Disjoint Networks, Number of nodes in each Disjoint Networks, Number of edges

Network building background Ø Using what we have – the 2 gm data. Ø Building a reverse network. Ø Store whatever is built.

Network Details Ø The folder structure. 1 Root directory 63 Second level Directories Third Level of Directories With a file inside each directory. Ø Consider the bigram “match day 2000”.

Parallelism Details Ø Block allocation Ø Lines distributed amongst processors instead of the files. Ø Processor 0 sends to each processor: l l l Number of lines it has to process File number from which it should start The starting line

Benchmarking Processors Timings Number of 2 gm files 16 3 hrs 56 mins 32 32 2 hrs 47 mins 32 64 1 hr 58 mins 32

Space Requirements Ø Not much to store in the memory. Ø Large space requirements. Around 5. 5 gbs for the google 2 gram data. Ø As a general rule, the reverse network will be approximately of the same size as the original data content.

Finding Disjoint Networks Module Description: This module deals with finding the disjoint networks from the google-2 gram data. It takes unigrams as input and for each unigram, it gets all the tokens connected to it and processes them as described later to find the disjoint network.

Approach We exploited the simple fact that if we have two networks of words and if any word is common in both the networks, then both the networks are connected. Example: Network 1: A --> B --> C --> D --> E --> Z Network 2: Q --> Y --> P --> R --> S --> A --> V --> In the above network A is common in both the networks thus we can say that both the networks are connected

Distribution of Data Ø Distributes the Unigram Data Ø Follows Block Distribution Ø Finds the number of Lines in the Unigram file Ø Then finds the interval for the block distribution

Data Structure Used We have used a two dimensional linked list structure. The first linked list (Network List) contains all the connected words and the second linked list connects all the network lists. Network List Base List

Working 1. Get the root tokens. 1 2 3 4 2. Get the words connected to the root tokens. 1 3. If it is the first root token. 1 B 1 4. If not first root token then process each word one by one. 2 Nature of this network: unique -> if connected to 1 existing network. -> if connected to some network different to the marked network -> Not Connected at all

Working (cntd. ) Cases: 1. None of the word in root token 2 exist in root token 1 B 1 1 B 2 2 2. Any one word exists in already existing network B 1 1 2

Working (cntd. ) 3. A word is common to a network different to the marked network To Process: Existing: 3 B 1 1 B 2 2 Result: B 1 1 3 2

Animated Example 1 B 1 2 3 4 1 2 2 TEMP

Observations & Conclusion Ø Ø Ø Execution takes lot of time. 2 gm-0031 data has 1869 networks. Initially fast. Execution slows down as network size increases. Use of linked list of arrays for speeding up the process.

Agglomeration Ø Combines work of all processors Ø Finds l l Number of Disjoint Networks Number of Nodes in each Network Ø For this step to work: l l Processor 0 and k (k = np/2) have networks in linked list Other Processors have written out their networks to file

How It Works Processors 1 to k-1 send their “local” number of networks to Processor 0 Ø Processors k+1 to (number of processors) -1 send theirs to Processor k Ø Ø Processor 0 and k combine networks l Open files and checks if a word is in their networks. • Yes – Combine the two networks (eliminating redundancy) • No – Add that network to its list of networks

Final Step Ø Processor k writes its networks to files Ø Sends its number of networks to Processor 0 Ø Processor 0 then combines those networks Ø Results l l l Processor 0 has list of disjoint networks Prints out number of disjoint networks Prints out the number of nodes in each network

Unigram Cut-Off Ø Happens during distribution of data to Processors Ø When distributing to Processors, check for condition l l If frequency of unigram is > cut-off, store in array for distribution. Else ignore that unigram

Associative Cut-off Ø Happens during the path finding module Ø For each path found l Find association score • If > association cut-off, then include in path • Else don’t include in path

Path Finding ØThis had queries supported to the constructed network. ØThe aim was to explore the built Network by Path Finding. ØThe queries allow a user to specify a target word, and display the paths of a given length leading to and from that word and to the words connected to those words etc

Requirements ØThe specified target word should be at the center of the paths that lead into and out from it. Path lengths are defined in terms of the number of edges in the path to and from the target word. Eg: was 3 (Path length : 3) ØItalian --> (34) --> poor --> (34) --> girl --> (43) --> was --> (34) --> hardworking --> (432) --> and --> (23) --> beautiful TIME: 0. 432 (+more path length 3 variations) ØThe number between 2 words represents frequency of those bi-grams

Algorithm (Broader View) ØRead the query (target-list) file according to the file format which is <token> <path length> ØDistribute each query from target-list to processors in a parallel manner (using MPI) ØEach processor builds its internal tree structure and finds the entire paths. Ø Needed to dedicate someone for printing. ØIf all start printing chaos occurs as we need full result set for a single query word clubbed together. ØAll processors send the path results they obtain to processor with rank 0 who is responsible for printing the individual paths obtained by each processor. ØCaching logic helped in cycle detection to some extent

Algorithm Details…(A closer Look) Get. Links. From. Node() (Network Interface) Build. To. Network() (Recursive) Get. Links. To. Node() (Network Interface) Build. From. Network() (Recursive) Add. To. Results() Add. Links() Start. Query. Search() Read. Query. File() Print. Output() Build. Network. For. Target() Create. Graph. Node() Get. Node. From. Cache() Add. Node. To. Cache()

Challenges The recursive traversal done for all the 'from' and ‘to’ nodes of the given target node limits the scope of parallelism. Ø Memory Issue : - v Maximum Memory Limit : - For path lengths till one there was no problem. Eg : Bush 1 entry has 20000+ 'From/To' words associated with it. For each of these 20000+ words when you start processing their ‘from’ and ‘To’ lists recursively there is huge investment of memory and time. This causes hitting the maximum memory limit on blade easily before path processing is complete. Blade : - maximum memory limit of 7 GB for user programs (4 million nodes in our case before it crashes) Ø

Alternatives to overcome challenges Fix memory leaks : - Code had memory leaks in some places. Identified major culprits in memory consumption and appropriately freed them for optimum memory consumption. Ø Major Bottlenecks Eg: Anytime a 'from' list or 'to' list for a token was obtained memory was not getting freed. § Function Add. To. Results() was allocating memory on every path found but was not freeing it. Ø

Alternatives to overcome challenges (continued) Migration to ALTIX : - since amount of memory available on ALTIX is lot more than blade the chances for path finding to work for greater than path length 1 were high. Ø Exploited good memory support on Altix by writing data to files. Ø This gave good results for path length upto 8 -9 for smaller scope target words and 4 -5 for little higher scope words. The files on which data was written were as big as 20 GB. Ø

Change in Methodology and Performance. Ø The new method exploited memory and also enhanced performance in terms of time required to find paths. Ø However because of the recursive nature of the algorithm, inherent sequential component was fixed and this limited performance according to Amdahl’s Law.

New Method…(A closer Look) Get. Links. From. Node() (Network Interface) Build. To. Network() (Recursive) Get. Links. To. Node() (Network Interface) Build. From. Network() (Recursive) Add. To. Results() Start. Query. Search() Read. Query. File() Setup. Output. Files() Add. Links() Build. Network. For. Target() Create. Graph. Node() Combine. Final. Output. Files() Write. Staging. Output() (Recursive) Get. Node. From. Cache() Add. Node. To. Cache()

How is this better ? ? Merging: - Left Side Paths Target Word Right Side Paths Final Output of Combined Files For each line of left file, combine that with every line in right file to form a complete path. This architecture allows parallelism at much more granular level than just query level

Questions ? ? Ø THANKYOU !