Data structures time IOs entropy joules Paolo Ferragina

  • Slides: 43
Download presentation
Data structures: time, I/Os, entropy, joules Paolo Ferragina Dipartimento di Informatica Università di Pisa

Data structures: time, I/Os, entropy, joules Paolo Ferragina Dipartimento di Informatica Università di Pisa

Our driving moral. . . Big steps come from theory. . . but do

Our driving moral. . . Big steps come from theory. . . but do NOT forget practice ; -)

Strings. . . why? v Ubiquitous: any datum is a sequence of bits, hence

Strings. . . why? v Ubiquitous: any datum is a sequence of bits, hence a string v Spur new problems in many areas: Geometry ü String-similarity search Points in high-dim space and NN-search ü Lower/upper-bounds to indexing via reductions to geo-problem Graphs ü Doc-doc similarity graph ü Query-log graphs ü Data compression ubiquitous in Text/Web mining edge iff 2 queries clicked on the same res-page Shortest paths on char-based weighted graphs [Ferragina et al, SODA 09, ESA 09]

(String-)Dictionary Problem Given a dictionary D of K strings, of total length N, store

(String-)Dictionary Problem Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support prefix-searches for a pattern P. Exact search Hashing Mitzenmacher, ESA invited ‘ 09

(Compacted) Dominated the string-matching scene [Fredkin, CACM 1960] in the ‘ 80 s-90 s

(Compacted) Dominated the string-matching scene [Fredkin, CACM 1960] in the ‘ 80 s-90 s with its suffix-version: the Suffix Tree Trie Performance: 0 • Search ≈ O(|P|) time • Space ≈ O(K + N) 1 y 2 stile zyg (2; 3, 5) 5 1 etic ial 2 3 s z aibelyite 2 omo 7 5 czecin ygy 4 systile syzygetic syzygial syzygy 6 szaibelyite szczecin szomo

Timeline: theory and practice. . . What about Software Engineers ? ? re T

Timeline: theory and practice. . . What about Software Engineers ? ? re T x i f f u rie ’ 9 ’ 8 0’ 7 0 S 0 0 T ‘ 6 e

(Compacted) Dominated the string-matching scene [Fredkin, CACM 1960] in the ‘ 80 s-90 s

(Compacted) Dominated the string-matching scene [Fredkin, CACM 1960] in the ‘ 80 s-90 s with its suffix-version: the Suffix Tree Trie Performance: 0 • Search ≈ O(|P|) time • Space ≈ O(K + N) 1 y . . . But in practice… 2 • Search: random memory accesses stile • Space: len + pointers + strings zyg (2; 3, 5) 5 1 etic ial 2 3 s z aibelyite 2 omo 7 5 czecin ygy 4 systile syzygetic syzygial syzygy 6 szaibelyite szczecin szomo

I What did systems implement? Used the Compacted trie, of course, but with 2

I What did systems implement? Used the Compacted trie, of course, but with 2 other concerns because of large data

1° issue: space concern …. systile syzygetic syzygial syzygy…. 2 http: //checkmate. com/All_Natural/Applied. html

1° issue: space concern …. systile syzygetic syzygial syzygy…. 2 http: //checkmate. com/All_Natural/Applied. html http: //checkmate. com/All_Natural/Aroma 1. html http: //checkmate. com/All_Natural/Aromatic_Art. html http: //checkmate. com/All_Natural/Ayate. html http: //checkmate. com/All_Natural/Ayer_Soap. html http: //checkmate. com/All_Natural/Ayurvedic_Soap. html http: //checkmate. com/All_Natural/Bath_Salt_Bulk. html http: //checkmate. com/All_Natural/Bath_Salts. html http: //checkmate. com/All/Essence_Oils. html http: //checkmate. com/All/Mineral_Bath_Crystals. html http: //checkmate. com/All/Mineral_Bath_Salt. html http: //checkmate. com/All/Mineral_Cream. html 5 33 45% Front Coding 5 0 33 34 38 38 34 35 35 33 42 25 25 38 33 http: //checkmate. com/All_Natural/ Applied. html roma. html 1. html tic_Art. html yate. html er_Soap. html urvedic_Soap. html Bath_Salt_Bulk. html s. html Essence_Oils. html Mineral_Bath_Crystals. html Salt. html Cream. html 0 http: //checkmate. com/All/Natural/Washcloth. html. . . http: //checkmate. com/All/Natural/Washcloth. html . . . Bender et al. , PODS 2006 Ferragina et al. , PODS 2008

2° issue: hierarchical memory track M Spatial locality or Temporal locality caching: less I/Os

2° issue: hierarchical memory track M Spatial locality or Temporal locality caching: less I/Os Less and Faster I/Os CPU 1 Internal Memory Count I/Os B HD

2 -level indexing 2 advantages: • Search ≈ typically 1 I/O • Space ≈

2 -level indexing 2 advantages: • Search ≈ typically 1 I/O • Space ≈ Front-coding over buckets Internal Memory CT on a sample systile szaielyite 2 limitations: • Sampling rate ≈ lengths of sampled strings • Trade-off ≈ speed vs space (because of bucket size) (Prefix) B-tree Disk …. 0 systile 2 zygetic 5 ial 5 y 0 szaibelyite 2 czecin 2 omo….

Timeline: theory and practice. . . Do we need to trade space by I/Os

Timeline: theory and practice. . . Do we need to trade space by I/Os ? Space + Hierarchical Memory e el v -le ffi B g rin t S 95 19 0 e e -tr 2 ’ 9 ’ 8 0 Su 0’ 7 ‘ 6 0 ie r T re T x g in x e d n i

[Morrison, J. ACM 1968] An old idea: Patricia Trie 0 1 y 2 stile

[Morrison, J. ACM 1968] An old idea: Patricia Trie 0 1 y 2 stile zyg etic 2 3 2 omo 7 5 ygy ial z aibelyite 5 1 s 4 czecin 6

[Ferragina-Grossi, J. ACM 1999] A new search 0 Search(P): • Phase 1: tree navigation

[Ferragina-Grossi, J. ACM 1999] A new search 0 Search(P): • Phase 1: tree navigation • Phase 2: Compute LCP • Phase 3: tree navigation y 1 2 s 5 z g < ya z Three-phase search: P = syzyyea s P’s position 2 0 1 2 o 5 c Only 1 string ise checked y i Trie Space ≈ #strings, NOT their length …. systile syzygetic syzygial syzygy szaibelyite szczecin szomo….

[Ferragina-Grossi, J. ACM 1999] > 15 US-patents cite it !! The String B-tree [Handbook

[Ferragina-Grossi, J. ACM 1999] > 15 US-patents cite it !! The String B-tree [Handbook of Comp. Biology, 2009] + Search(P) • O((p/B) log. B K) I/Os • O(occ/B) I/Os 1 string checked : O(p/B) PT 29 13 20 18 3 O(log. B K) levels 23 It is dynamic. . . PT 29 2 PT 26 13 PT 29 1 9 20 25 PT 5 2 26 10 4 PT 6 PT 7 13 Lexicographic position of P 20 16 28 18 3 14 PT 8 25 6 12 15 22 18 21 23 PT 3 27 24 11 PT 14 21 17 23 Knuth, vol 3°, pag. 489: “elegant”

I/O-aware algorithms & data structures I/Os was the main concern [CACM 1988] [2006] Huge

I/O-aware algorithms & data structures I/Os was the main concern [CACM 1988] [2006] Huge literature !!

Timeline: theory and practice. . . Not just 2 memory levels 95 19 0

Timeline: theory and practice. . . Not just 2 memory levels 95 19 0 S ’ 7 ’ 9 0 ’ 8 B g trin 2 - S 0 - ‘ 6 0 ie r T in 99 i f f u el v le e e -tr 19 ee r x. T g n i x de Space CPU registers v Parameter-free solutions ü L 1 L 2 Cache Anywhere, anytime, anyway. . . I/O-optimal !! RAM HD net Cache-oblivious Algo. and Data Str. See chap by Arge, Brodal, Fagerberg

Some precious achievements. . . v Cache-oblivious trie ü Static dictionary of strings [Brodal

Some precious achievements. . . v Cache-oblivious trie ü Static dictionary of strings [Brodal et al, SODA 2006] v Cache-oblivious String B-tree ü Dynamic dictionary of strings [Bender et al, PODS 2006] v Cache-oblivious tree mapping Patricia Trie ü Split-and-Refine that applies to any B-fixed tree partitioning [Alstrup et al, manuscript 2003] ü Worst-case solution [Demaine et al, manuscript 2004]

Timeline: theory and practice. . . Not just 2 memory levels 19 0 ’

Timeline: theory and practice. . . Not just 2 memory levels 19 0 ’ 8 0 - 95 2 S Compressed data structures B g rt in S ’ 7 ‘ 6 0 T e v e -l tr 99 i f f u rie e d n li Cache-oblivious data structures ee 19 r T x ee g n i x Space

A challenging question [Ken Church, AT&T 1995] Soft. Eng. use many “squeezing heuristics” that

A challenging question [Ken Church, AT&T 1995] Soft. Eng. use many “squeezing heuristics” that compress data and still support fast access to them Can we “automate” and “guarantee” the process ?

Aka: Compressed self-indexes Opportunistic Data Structures with Applications P. Ferragina, G. Manzini. . .

Aka: Compressed self-indexes Opportunistic Data Structures with Applications P. Ferragina, G. Manzini. . . now, J. ACM 2005 n n Space for text + (full-text) index compressed text ( Hk) Query/Decompression time theoretically (quasi-)optimal

[Burrows-Wheeler, 1994] The big (unconscious) step. . . Let us given a text T

[Burrows-Wheeler, 1994] The big (unconscious) step. . . Let us given a text T = mississippi#m ssissippi#mis issippi#missi sippi#mississ ppi#mississip i#mississippi Sort the rows # i i m p p s s mississipp #mississip ppi#missis ssippi#mis ssissippi# ississippi i#mississi pi#mississ ippi#missippi#mi sippi#miss sissippi#m i p s s m # p i s s i i Can we compress it ?

[Burrows-Wheeler, 1994] The big (unconscious) step. . . Let us given a text T

[Burrows-Wheeler, 1994] The big (unconscious) step. . . Let us given a text T = mississippi#m ssissippi#mis issippi#missi sippi#mississ ppi#mississip i#mississippi Sort the rows bwt(T) # i i m p p s s mississipp #mississip ppi#missis ssippi#mis ssissippi# ississippi i#mississi pi#mississ ippi#missippi#mi sippi#miss sissippi#m i p s s m # p i s s i i T bzip 2 = BWT + other simple compressors

[Burrows-Wheeler, 1994] The big (unconscious) step. . . Let us given a text T

[Burrows-Wheeler, 1994] The big (unconscious) step. . . Let us given a text T = mississippi#m ssissippi#mis issippi#missi sippi#mississ ppi#mississip i#mississippi Sort the rows Suffix Array bwt(T) # i i m p p s s mississipp #mississip ppi#missis ssippi#mis ssissippi# ississippi i#mississi pi#mississ ippi#missippi#mi sippi#miss sissippi#m i p s s m # p i s s i i T bzip 2 = BWT + other simple compressors

From practice to theory. . . [Ferrag bwt(T) # i i m p p

From practice to theory. . . [Ferrag bwt(T) # i i m p p s s mississipp #mississip ppi#missis ssippi#mis ssissippi# ississippi i#mississi pi#mississ ippi#missippi#mi sippi#miss sissippi#m i p s s m # p i s s i i ina-Man zini, Fo cs ‘ 00] FM-index = BWT is searchable. . . or Suffix Array is compressible • Space = l |T| Hk + o(|T|) bits • Search(P) = O(p + occ * polylog(|T|)) Nowadays tons of papers: theory & experiments [Navarro-Makinen, ACM Comp. Surveys 2007]

Compressed & Searchable data formats Texts Trees FOCS 2000 SODA 2003, 04 SODA 2007

Compressed & Searchable data formats Texts Trees FOCS 2000 SODA 2003, 04 SODA 2007 SPIRE 2007 CPM 2008 CPM 2010 ICALP 2010 Integer Sets SODA 2002 … FOCS 2008 STACS 2009 SODA 2002 SODA 2007 ICALP 2007 SWAT 2008 ICALP 2009 SODA 2010 Labeled Trees Functions Point Sets SODA 2002 FOCS 2005 WWW 2006 SODA 2007 ICDE 2010 ICALP 2003, 04 SODA 2004 ICALP 2008 ESA 2009 LATIN 2010 SODA 2003 TALG 2007 WADS 2009 SODA 2009 Graphs DCC 2001 WWW 2004 ISAAC 2007 ESA 2008 FOCS 2009 Images DCC 2008

[December 2003] [January 2005]

[December 2003] [January 2005]

ACM J. on Experimental Algorithmics, 2009

ACM J. on Experimental Algorithmics, 2009

> 103 faster than Smith-W. >102 faster than SOAP & Maq What about the

> 103 faster than Smith-W. >102 faster than SOAP & Maq What about the Web ? [Ferragina-Manzini, ACM WSDM 2010]

Where we are nowadays 19 0 ’ 8 Compressed data structures B g rt

Where we are nowadays 19 0 ’ 8 Compressed data structures B g rt in S 95 2 S 0’ 7 ‘ 6 0 T e v e -l tr 99 i f f u rie e d n li Cache-oblivious data structures 19 r T x ee g n i x ee Something is known. . . yet very preliminary [PODS ‘ 08, Navarro, Vitter, . . . ] Bellazougui et al, this ESA

What else. . . [E. Gal, S. Toledo. ACM Comp. Surv. , 2005] [Ajwani

What else. . . [E. Gal, S. Toledo. ACM Comp. Surv. , 2005] [Ajwani et al, WEA 2009] v Solid-state disks: no mechanical parts ü. . . very fast reads, but slow writes & wear leveling v Self-adjusting or Weighted design ü Time ops depend on some (un/known) distribution ü Challenging : no pointers, self-adjust (perf) vs compression (space)

A bigger challenge: from micro to macro ! IEEE Computer, 2007

A bigger challenge: from micro to macro ! IEEE Computer, 2007

Approach #1 (engineering oriented) News: Proper system components + specific algorithms v Sanders &

Approach #1 (engineering oriented) News: Proper system components + specific algorithms v Sanders & Meyer’s groups, IEEE Conf. on Green Comp. 2010 [SSDisks + Atom + Sort]

Approach #2 (Manage resources) Goal: Develop on-line algorithms that dynamically manage power by trading

Approach #2 (Manage resources) Goal: Develop on-line algorithms that dynamically manage power by trading off performance, energy and reliability v Susanne Albers, Comm. ACM 2010

Approach #3 (models and algorithms) IEEE Computer, 2009 “Algorithmics offers benefits that Wo IEE

Approach #3 (models and algorithms) IEEE Computer, 2009 “Algorithmics offers benefits that Wo IEE rksho Gre E Con p in f. o en C om n extend far beyond p. 20 1 TCS into the design of systems. ” 0

Sometimes energy is a primary resource!

Sometimes energy is a primary resource!

Energy-aware Algo+Ds ? Memory-level impacts Lo ca lity pa ys off I/Os and compression

Energy-aware Algo+Ds ? Memory-level impacts Lo ca lity pa ys off I/Os and compression are obviously important BUT here there is a new twist

MIPS per Watt ? Battery life !! Ap Who cares whether your application: pri

MIPS per Watt ? Battery life !! Ap Who cares whether your application: pri 1. is y% slower than optimal, but it is more energy efficient ? pro nci ach ple dw in a 2. occupies x% more space than optimal, but decompr is faster ? ay

MIPS per Watt ? Idea: Multi-objective optimization in data-structure design Battery life !! Stay

MIPS per Watt ? Idea: Multi-objective optimization in data-structure design Battery life !! Stay tuned: Algorithm Library for Mobile Phones

v Hbase - Hadoop Big. Table, 2006 Cosmos Hyper. Table Cassandra Real-time search Q&A

v Hbase - Hadoop Big. Table, 2006 Cosmos Hyper. Table Cassandra Real-time search Q&A social search Knowledge search

Many ingredients v Items are graphs, vectors, strings, … v Number and size are

Many ingredients v Items are graphs, vectors, strings, … v Number and size are VERY large v Involve many resources to be optimized: ü Time (speed/patience) ü Space (#disks/management costs) ü Bandwidth (speed/€) ü Energy (€) Multi-objective optimization in data-structure design!

That’s all ! I Look at my paper in the proceedings

That’s all ! I Look at my paper in the proceedings