Labs 3 BiGrams Step 1 Get Started Login

Labs 3: Bi-Grams

Step 1: Get Started • Login: – Username: nombrecc 5212 – Password on board • http: //aidanhogan. com/teaching/cc 5212 -1/mdp-lab 3. zip – C: /Program Files (x 86)/eclipse/ (in Spanish ) – File > Import > … • http: //aidanhogan. com/teaching/cc 5212 -1/External. Merge. Sort. java – Only if you weren’t here last week (half marks) • Use es-abstracts. txt. gz from the last time

Scale! … knowing how to build a scalable system over many machines requires knowing how to build a scalable system on one machine first How can we count a large set of bi-grams on one machine! • Won’t fit in memory so what do we do?

Phrasing • Bi-grams! – Phrase of two adjacent words • When we counted words … – Counting done in memory – Merging done in memory – Faster on one machine! • More bi-grams than single words! – So how can we scale the computation? – Won’t fit in memory! (or will it? ) Tengo a? Tengo de? Tengo que?

Step 2: Fix Some Noise … org. mdp. wc. Word. Parser. Iterator load. Next()

Step 2: Extract Bigrams to a File • org. mdp. cli. Extract. Bigrams – Small file for testing (): -i [path]es-abstracts. txt. gz -igz -o [path]bigrams-10 k. txt –n 10000 – Large file for real run (GZipped): -i [path]es-abstracts. txt. gz -igz -o [path]bigrams. txt. gz –ogz

Step 3: Try In-memory Count Will it run for the big file? • org. mdp. cli. Run. Bigram. Count. In. Memory -i [path]bigrams. txt. gz –igz –k 500

External Merge-Sort 1: Batch • Sort in batches Input on-disk (Input size: n) bigram 121 bigram 42 bigram 732 bigram 42 bigram 123 bigram 149 bigram 42 bigram 1294 bigram 123 bigram 42 bigram 6 bigram 123 In-memory sort (Batch size b) bigram 42 bigram 6 bigram 123 bigram 42 bigram 121 bigram 149 bigram 123 bigram 1294 bigram 732 bigram 123 Output batches on-disk (�n/b�batches) bigram 42 bigram 121 bigram 732 bigram 42 bigram 123 bigram 149 bigram 1294 bigram 6 bigram 42 bigram 123

Step 4: Implement Batching org. mdp. cli. External. Merge. Sort • Implement write. Sorted. Batches() – Load batch. Size lines into memory • Array. List<String> list – When list. size() == batch. Size • Dump the data to a batch • String batch. Name = get. Batch. File. Name(tmp. Folder, batch. Id); • Print. Writer batch = open. Batch. File. For. Writing(batch. Name); • Clear the list and close the batch file • Add the batch-name to batch. Names() – Do some logging! – Forget about reverse. Order for now

Step 5: Implement Merging org. mdp. cli. External. Merge. Sort • Implement merge. Sorted. Batches() – Open files for reading • Buffered. Reader[] brs = new Buffered. Reader[batches. size()]; – Read a line from each file into memory – Select the lowest line (from file i), write to out • Load the next line from file I – Do some logging! – Forget about reverse. Order for now

External Merge-Sort 2: Merge Sorted output (Output size: n) bigram 6 bigram 42 bigram 121 bigram 123 bigram 149 bigram 732 bigram 1294 In-memory sort Input batches on-disk (�n/b�batches) bigram 42 bigram 121 bigram 732 bigram 42 bigram 123 bigram 149 bigram 1294 bigram 6 bigram 42 bigram 123

Step 6: Try Sorting 10 k Bigrams org. mdp. cli. External. Merge. Sort -i [path]bigrams-10 k. txt -o [path]bigrams-10 k-sorted. txt –b 3000 If successful, try sorting the large file! Use batches of size 250000. (Don’t forget -igz/-ogz) If not successful, try debugging. If stuck, ask me.

Counting bigrams is then easy? bigram 6 bigram 42 bigram 121 bigram 123 bigram 149 bigram 732 bigram 1294 Could use merge-sort again to order by occurrence! bigram 6, 1 bigram 42, 4 bigram 121, 1 bigram 123, 3 bigram 149, 1 bigram 732, 1 bigram 1294, 1

Step 7: Implement Counting org. mdp. cli. Count. Duplicates • Implement count. Duplicates() – Store two lines: current and last – If current line same as last line, increment counter – If current line different from last line, print count and line to a file, reset count • Use String sort. Num = String. With. Number. get. Sortable. Number(du pes);

Step 8: Try Counting 10 k Bigrams org. mdp. cli. Count. Duplicates -i [path]bigrams-10 k-sorted. txt -o [path]bigrams-10 k-counts. txt If successful, try counting the large file! (Don’t forget -igz/-ogz) If not successful, try debugging. If stuck, ask me.

Step 9: Implement Reverse Order org. mdp. cli. External. Merge. Sort • In write. Sorted. Batches() & external. Merge. Sort()

Step 10: Merge-Sort the Counts org. mdp. cli. External. Merge. Sort -i [path]bigrams-10 k-counts. txt -o [path]bigrams-10 k-counts-sorted. txt – b 3000 -r If successful, try sorting the large file! Use batches of size 250000. (Don’t forget -igz/-ogz) If not successful, try debugging. If stuck, ask me.

Step 11: Get the top 500 org. mdp. cli. Copy. Lines. From. File -i [path]bigrams-countssorted. txt. gz –igz -o [path]bigrams-counts-sortedtop 500. txt –n 500

Final Step: Profiling (Optional) Java Interactive Profiler • Run External. Merge. Sort for a large file • Use VM arguments: -javaagent: libprofile. jar –noverify • When finished, check profile. txt in your project’s root directory • See if you can optimise something in “Most Expensive Methods”

Final Steps • Remove tmp/ folder from mdp-lab 3/ folder and recycle bin (Shift + Del) • I set up tareas.