FLASK Coherence A Morphable Hybrid Coherence Protocol to
FLASK Coherence A Morphable Hybrid Coherence Protocol to Balance Energy, Performance and Scalability Lucía G. Menezo Valentín Puente Jose Ángel Gregorio
Motivation • Complex on-chip cache hierarchies in CMPs • To maintain coherence (and programmers ): – Hardware mechanism coherence protocol • What type of protocol should be used? – Not universal solution – Trade-offs in costs, energy and performance HPCA 2015 Lucia G. Menezo 2
Pure coherence: directory-based • Structure with all the coherence information • Demands directory inclusivity • Duplicate-tag inefficient with large num of cores due to associativity • To meet energy constraints: overprovision the directory capacity to minimize evictions in private caches • Private cache sizes growing number of tracked blocks also grows HPCA 2015 Lucia G. Menezo 3
Pure coherence : broadcast-based • Miss in private caches broadcast request • Better resource utilization: neither inclusivity nor additional structures to track block copies are required • ++ traffic and cache snoops -- energy efficiency • Traffic impact in medium/large-scale noticeable or possibly unsustainable • Better performance although on-chip resource contention could degrade it HPCA 2015 Lucia G. Menezo 4
Hybrid coherence ? FLASK Performance of broadcast-based Energy efficiency of a directory-based • Structure to track: – Shared blocks precisely (actively ones) – Private blocks approximately • Broadcast after some misses in the structure • Additional features to minimize on-chip and off-chip traffic HPCA 2015 Lucia G. Menezo 5
FLASK Coherence Controller FLASK Controller • Directory only tracks actively shared blocks FC FC LLC 0 LLC 1 LLCn • Filter to know (aprox. ) when a block is inside the chip Token Coherence protocol: Initially each block : = # tokens (==#procs) Read request: data and 1 token Write request: data and all tokens On-chip network • Token coherence to: – To guarantee coherent invariants Private cache – To monitor when to update FLASK structure Core HPCA 2015 Lucia G. Menezo 6
FLASK Coherence Basics • Private block – Block only in one private cache no entry allocated in directory – Replacement will move the block to further levels until LLC • Actively shared block – A block present in any private cache is requested by another core – Need to allocate an entry in the directory HPCA 2015 Lucia G. Menezo 7
FLASK Coherence Controllers Last Level Caches FLASK Conceptual Approach Directory Filter VI sharers -- 0100 3 2 2 Private Caches 1 SI 0 1 P 0 HPCA 2015 data N/A O 3 43 data S P 1 1 P 2 Lucia G. Menezo data I 0 N/A P 3 8
Reconstruction process • All coherence agents will respond with the number of tokens owned for the requested data • 2 possible cases: 1) LLC has all the tokens – – No other copy of the block in the chip Block not actively shared: no entry allocated. 2) Any private cache has token – Block actively shared and entry is allocated in dir. 3) Non of them have any tokens – Filter gave a false positive HPCA 2015 Lucia G. Menezo 9
FLASK Filter • d-left Counting Bloom Filters – Double efficiency of the counters of a CBF • Each filter: – attached to each directory entry – tracks all the tags that map in that entry • Combination of hash function and permutations to know which bits are modified F. Bonomi et al. “An improved construction for counting bloom filters, ” in ESA’ 06 14 th Annual European Symposium, 2006. HPCA 2015 Lucia G. Menezo 10
When is the FLASK filter modified? • Increment filter: after an on-chip miss – Counter saturation avoidance mechanism • Decrement filter: LLC evicts block with all tokens – FLASK prepared in case LLC needs to evict with not all tokens (<0. 01%) • Updates are done overlapped with main memory access or after LLC replacement (outside the critical path) HPCA 2015 Lucia G. Menezo 11
Resource Partitioning • Directory and filter are “complementary” ++ actively shared blocks -- pressure on filter ++ private blocks -- pressure on directory Directory tag sharers C 0, C 1, …, Cn Ctag 0, C 1, …, Cn sharers Filter C 0, C 1, …, Cn Ctag 0, C 1, …, Cn sharers tag C 0, C 1, …, Cn … C 0, C 1, …, Cn sharers tag C 0, C 1, …, Cn sharers C 0, C 1, …, Cn … C 0, C 1, …, Cn • Dynamically decide according workload characteristics HPCA 2015 Lucia G. Menezo 12
16 MB 16 -way NUCA Mapping Memory Capacity Max. Outstanding Mem. Operations 64 Topology 4× 4 Mesh HPCA 2015 F Static, interleaved across slices 4 GB $1 C C C Lucia G. Menezo LLC 8 $1 $2 C $2 $1 F F $1 C LLC 6 $2 LLC 7 $1 R $2 $1 C LLC 10 R LLC 13 R R $2 LLC 11 R C $2 $1 R LLC 9 R C $2 C R LLC 12 $1 $1 C C LLC 5 R $2 R F Size / Associativity LLC 4 LLC 3 $2 F 256 KB, 8 -way (exclusive with L 1) C R $1 R $2 $1 C LLC 14 R F L 2 Size / Associativity F 32 KB I/D, 4 -way F L 3 Shared Private C L 1 Size / Associativity $1 $2 LLC 2 F $1 R F 64 B $2 LLC 1 F Block size LLC 0 F 196, 6 -way F IWin size/Issue Width F 16 @3 GHz F Number of cores Flask Evaluation methodology $2 LLC 15 R C $1 R $2 C $1 13 $2
Token. B 1, 8 1, 6 Token. B Dir FLASK (160% - 80% - 40% - 20% - 10% - 5%) 1, 4 1, 2 1 0, 8 e ag Av er us Ze TP b Jb OL Ap ac he G M LU FT t ne Om er m Hm ta r 0, 6 As Normalized Execution Time FLASK Execution Time X%: space available to track X% of the private cache block tags • Dir: 160%=8 K entries; 5%=32 entries • FLASK: 160%=4 K entries; 5%=16 entries HPCA 2015 Lucia G. Menezo 14
Analyzing On-Chip Traffic -- directory ++ filter HPCA 2015 Lucia G. Menezo ++ directory -- filter 15 … 5% Token. B 160% Adapting resources
Normalized Mem. Hier. EDP To Network Directory LLC L 2 L 1 Sp 2, 5 ke n. B ar se D Fla ir (1 60 sk (5 %) %) On-chip memory hierarchy EDP 2 1, 5 1 0, 5 0 HPCA 2015 Lucia G. Menezo 16
Conclusions and future work • FLASK: re-architectures directory coherence protocols with benefits from snoop-based coherence • Improve performance and power (with extreme configurations) with no high toll • Cloud-computing scenarios – Morphing dir & filter during application execution • Multi-CMP: hierarchical coherence – Use directory+filter+token information to minimize traffic between chips HPCA 2015 Lucia G. Menezo 17
Thank you Questions? HPCA 2015 Lucia G. Menezo 18
HPCA 2015 Lucia G. Menezo 19
Sketch of a dl. CBF filter Hash Function Permutation 2 Permutation 1 Address bucket 1, Remainder 1 b 0 b 1 b 2 b 3 b 4 bucket 2, Remainder 2 b 0 b 1 b 2 b 3 b 4 R 1 cnt HPCA 2015 Lucia G. Menezo 20
Memory latency overhead Norm. Mem. Access Time Token. B 1, 6 Dir FLASK (160% - 80% - 40% - 20% - 10% - 5%) 1, 4 1, 2 1 0, 8 0, 6 0, 4 0, 2 0 Astar Hm. Om. FT LU MG Ap. Jbb OLTP Zeus • • • Off-chip requests are induce by LLC capacity misses and both protocols handle LLC the same Numerical: false positives affect negatively due to delay in off-chip access Commercial: extra traffic of compulsory reconstructions increases contention delaying memory access • Effect < 5% in the most adverse configuration unnoticeable in average access time HPCA 2015 Lucia G. Menezo 21
Dir 1 -way D 32 -way F 8 -way 2, 5 2 D 2 -way D 64 -way F 16 -way D 4 -way Flask 1 -way F 32 -way D 8 -way F 2 -way F 64 -way D 16 -way F 4 -way 1, 5 1 0, 5 ra ge Av e s Ze u TP OL b Jb e Ap ac h G M LU FT tp p ne Om m Hm As ta er 0 r Normalized Exec. Time Performance with different associativities • Negligible impact of directory conflicts on FLASK performance: 1) No need to perform external invalidations after a directory eviction 2) Directory only tracks actively shared blocks HPCA 2015 Lucia G. Menezo 22
Dir 1 -way D 32 -way F 8 -way 2, 5 2 D 2 -way D 64 -way F 16 -way D 4 -way Flask 1 -way F 32 -way D 8 -way F 2 -way F 64 -way D 16 -way F 4 -way 1, 5 1 0, 5 ra ge Av e s Ze u TP OL b Jb e Ap ac h G M LU FT tp p ne Om m Hm As ta er 0 r Normalized Exec. Time Performance with different associativities • Negligible impact of directory conflicts on Flask performance: 1) No need to perform external invalidations after a directory eviction 2) Directory only tracks actively shared blocks HPCA 2015 Lucia G. Menezo 23
MOSAIC 160% 1, 2 1 Data Control 0, 8 0, 6 0, 4 0, 2 ac he 0 FLASK 160% Ap Normalized Bandwidth MOSAIC vs. FLASK • FLASK only reconstructs shared blocks • Multi-programmed: -- traffic (increases as the filter is reduced) • Commercial: high sharing degree, compulsory reconstructions are unavoidable HPCA 2015 Lucia G. Menezo 24
- Slides: 24