Algorithms and Data Structures for Massive Datasets Acube

Algorithms and Data Structures for Massive Datasets (Acube Lab) Rossano Venturini Dipartimento di Informatica Università di Pisa Paolo Ferragina Giuseppe Prencipe Marco Cornolti Andrea Farruggia Giovanni Micale Francesco Piccinno Giorgio Audrito

3 A Lab (acube. di. unipi. it) Algorithms and data structures for massive dataset – Data Compression – Compressed Indexing • Web or arbitrary texts • Storage and analysis of massive graphs – Information Retrieval on news, tweet, … Submitted US patents: 3 with Yahoo, 1 with NYU Accepted US patents: 1 with U. Rutgers, 1 with AT&T-Lucent 2

Social Networks and Social Data Given an idea, you need the right platform to implement it: • Graph structure + Textual Content • Nodes users (~ 1 bil) • HW + SW (IT Center) • Algorithms (our Lab) • Edges explicit = friend, follower, retweet, +1, … (~ 10 bil) • Edges implicit = similarity, co-occurrence, click, … (» 100 bil) 3

Data Compression: Theory & Engineering Key issue: • Minimize space occupancy • Maximize decompression speed Compressor on DBLP J. ACM ‘ 05 ACM-SIAM Soda ’ 09 -’ 14 ACM WSDM ‘ 10 ESA ’ 11 -’ 14 Algorithmica ‘ 12 SIAM J. Computing ‘ 13 Compressed space (MB) Decompression time (secs) A new algorithmic concept: Gzip 191 11. 6 Multi-objective design bzip 2 121 49 Snappy 323 2. 1 LZ 4 215 1. 9 130 149 2. 9 1. 9 Our result Can we fix the space occupancy and minimize the decompression time ? Or, vice versa ? of compressors Two interesting scenarios: - Energy-efficiency issues - Cloud computing

Compressed Indexing: Theory & Engineering • Minimize space occupancy • Maximize substring-search throughput J. ACM ‘ 05 ACM SIGIR ‘ 07 J. ACM ‘ 09 ACM Trans. Algo. ’ 10 ESA ’ 13 ACM-SIAM SODA ’ 13 Suffix-array compressible «-» Bzip searchable … and many others Key issue: December 2003 Performance over hundreds of MBs and commodity PC • Count(P) takes 5 microsecs/char, taking about bzip’s space • Locate(P) outputs 100 K occ/sec, taking +10% space This may be 4 x faster than IL, within <35% space occupancy

Compressed Indexing: Theory & Engineering The <key, value> problem: No SQL DB § Trie: 14 x more space than input data. § Front-coding & two-level indexing: § 110% of input data § 4 microsecs/char § Our Compressed Permuterm: § < 25% of input data, i. e. close to bzip 2 § 10 60 microsecs/char § So, time close to FC but one-fourth of its space Under Y!-patenting

We know how to “manage” everything… 9

Information Retrieval “Diego Maradona won against Mexico” Dictionary against Diego Maradona Mexico won TF-IDF vector 2. 2 5. 1 9. 1 1. 0 0. 1 Vector Space model t 3 v a t 2 w t 1 Similarity(v, w) ≈ cos(a)

Topic Annotators • “Diego Maradona won against Mexico” The soccer player Mexico soccer team Detect mentions and annotate them with entity/topic extracted from a catalog Wikipedia! we serve about 170 k requests/day

Paper at IEEE Software 2012 Paper at ACM WSDM 2012 Details on. . . http: //acube. di. unipi. it/tagme 14 Paper at ECIR 2012