Improving Prefetching Mechanisms for Tiled CMP Platforms 28

Motivation - Prefetching • Well known memory-hiding technique • Bring to a nearest cache

Motivation – So, why prefetching? • Implemented in most commercial processors • Changes in

Motivation – New generations? • Number of cores in a same chip grows every

Motivation - Objectives 1. Prefetching in multi-core platforms 2. Confidence predictor for prefetching in

Prefetch basics: Main prefetchers • Sequential Prefetching Tagged prefetcher • Stride Prefetching RPT prefetcher

Prefetch basics: Metrics • Accuracy • Aggressiveness • Other: – Lateness – Pollution 9

Outline Motivation Prefetching in multi-core platforms Confidence predictor for prefetching in CMPs Dynamic management

Prefetching in multi-core platforms • Erroneous prefetching may produce slowdown – Cache pollution –

Main open-source simulators Core Memory marss 86 X X Simics Gems X X gem

Modifications in gem 5 Requested by gem 5 community 16

Experimental framework • 3 classical prefetch engines: – Tagged, RPT, and GHB • With

L 1 average miss latency (cycles) Misses every 1000 instructions (MPKI) Results 18

Results IPC Increasing the level of detail of the simulations may increase the accuracy

Confidence predictor • Many techniques try to improve prefetching • Dynamic management techniques •

State of the art predictor • Last phase heuristic (LP) Time phases Last phase

Proposed predictors • Phase historic heuristic (PH) Time phases Current phase (x+1) Execution time

Proposed predictors • Balanced phase historic heuristic (BP) Time phases Current phase (x+1) Execution

Combined heuristic (COMB) • Combining the previous two techniques • Profiling according to the

Result analysis: Predicted accuracy Tagged prefetcher RPT prefetcher GHB prefetcher 31

Result analysis: Predicted accuracy Combined technique provides the most accurate prediction Tagged prefetcher RPT

Dynamic management of requests • Novel dynamic management technique • Using our confidence predictor

Confidence predictor based filter • Strategy: – Discard requests – According to its predicted

Dynamic warmup strategy Warmed up Not Warmed up p threshold: . 7 f threshold:

Experimental framework • Focused on the RPT stride prefetcher • With 2 baselines (Filtering

Filtering results Baseline filtering Combined predictor filtering 41

Prioritization + Filtering results Improving the confidence prediction of the requests can reduce the

Distributed memories • Distribution of the memory access pattern: @ @+2 @+4 @+6 @+8

Prefetch Distributed Memory Systems • Analysis phase Pattern detection challenge DISTRIBUTED L 2 MEMORY

Prefetch Distributed Memory Systems • Request generation phase DISTRIBUTED L 2 MEMORY @ @+2

Prefetch Distributed Memory Systems • Evaluation phase Dynamic profiling challenge DISTRIBUTED L 2 MEMORY

Challenge evaluation methodology • Three environments to test the challenges • Pattern Detection Challenge:

Experimental framework • GHB and Tagged prefetcher • With the gem 5 Simulator using

Facing the challenges • There are two main options – Redesign the entire prefetch

Conclusions • We have seen that new technologies provide opportunities on improvement in prefetching.

Conclusions • We have proposed a novel confidence predictor that can accurately predict the

Contributions • Prefetching evaluation in multi-core platforms – M. Torrents, R. Martínez, C. Molina,

Contributions • Confidence predictor mechanisms for prefetching in CMPs – M. Torrents, R. Martínez,

Contributions • Improving the Prioritization and Filtering of Prefetch Requests – M. Torrents, R.

Improving Prefetching Mechanisms for Tiled CMP Platforms 22 th July of 2016 Candidate: Advisors:

Reducing the division costs • Calculate and evaluate the accuracy – – – IS

Slides: 65

Download presentation

Improving Prefetching Mechanisms for Tiled CMP Platforms 28 th November of 2016 Candidate: Advisors: Tutor: Martí Torrents Raúl Martínez, Carlos Molina Antonio González

Motivation - Prefetching • Well known memory-hiding technique • Bring to a nearest cache the CPU required data • Firsts prefetchers from 70’s • Thousands of prefetching papers published Font: ACM digital library “prefetcher” search 2

Motivation - Prefetching 3

Motivation – So, why prefetching? • Implemented in most commercial processors • Changes in technology challenged prefetching • Evolved towards technology generations • What about next generations? 4

Motivation – New generations? • Number of cores in a same chip grows every year Intel Xeon E 7 12 Cores 24 threads Tilera 64~100 Cores Intel Polaris 80 Cores Nvidia Ge. Force Up to 256 Cores 5

Motivation – Tiled CMP platforms CPU 6

Motivation - Objectives 1. Prefetching in multi-core platforms 2. Confidence predictor for prefetching in CMPs 3. Dynamic management of prefetch requests 4. Prefetching challenges in DSMs 7

Prefetch basics: Main prefetchers • Sequential Prefetching Tagged prefetcher • Stride Prefetching RPT prefetcher • Correlation Prefetching GHB prefetcher +1 +1 +x+Y +x +x +Y 8

Prefetch basics: Metrics • Accuracy • Aggressiveness • Other: – Lateness – Pollution 9

Outline Motivation Prefetching in multi-core platforms Confidence predictor for prefetching in CMPs Dynamic management of prefetch requests Prefetching challenges in DSMs Conclusions 10

Outline Motivation Prefetching in multi-core platforms Confidence predictor for prefetching in CMPs Dynamic management of prefetch requests Prefetching challenges in DSMs Conclusions 11

Prefetching in multi-core platforms • Erroneous prefetching may produce slowdown – Cache pollution – Resources consumption (queues, bandwidth, etc. ) – Power consumption – Interfere with other cores requests • Simulation tools should include this capability – Current Tiled CMP simulators do not include prefetching – Core and No. C must be simulated together – Updated simulator 12

Main open-source simulators Core Memory marss 86 X X Simics Gems X X gem 5 Classic X X gem 5 Ruby X X No. C Updated Prefetch X X X X 13

The gem 5 simulator 14

Modifications in gem 5 15

Modifications in gem 5 Requested by gem 5 community 16

Experimental framework • 3 classical prefetch engines: – Tagged, RPT, and GHB • With different aggressiveness – Low, medium, and high • With the gem 5 Simulator using – – 16 tiled x 86 CPUs L 1 prefetchers Ruby memory system MOESI coherency protocol • Parsec 2. 1 benchmark suite 17

L 1 average miss latency (cycles) Misses every 1000 instructions (MPKI) Results 18

IPC Results 19

Results IPC Increasing the level of detail of the simulations may increase the accuracy of its results 20

Outline Motivation Prefetching in multi-core platforms Confidence predictor for prefetching in CMPs Dynamic management of prefetch requests Prefetching challenges in DSMs Conclusions 21

Confidence predictor • Many techniques try to improve prefetching • Dynamic management techniques • Depend on the profiling strategies • Used to profile different statistics • Most of them focus on the accuracy 22

State of the art predictor • Last phase heuristic (LP) Time phases Last phase (x) Current phase (x+1) Execution time 23

Proposed predictors • Phase historic heuristic (PH) Time phases Current phase (x+1) Execution time 24

Proposed predictors • Balanced phase historic heuristic (BP) Time phases Current phase (x+1) Execution time 25

Stream based heuristic (SB) • Prefetcher triggers a stream of requests • Profile according to the position in the stream 26

Code region heuristic (CR) • Identify where the prefetcher has been triggered • Profile according to this region of code 27

Combined heuristic (COMB) • Combining the previous two techniques • Profiling according to the region and the stream 28

Prediction flow Region profile table Prefetch profile table Pos 1 Pos 2 Pos 3 Pos 4 @ Reg Id Pos Reg Id U !U 0 x 04 0 x 21 1 0 x 21 0 0 0 0 x 08 0 x 21 2 0 x 0 B 0 x 21 3 0 x 0 D 0 x 21 4 Dynamic manage Prefetch Queue @ Reg Id Pos L 1 Cache No. C @ P 0 x 04 1 0 x 08 1 0 0 x 0 B 0 x 0 D 0 x 04 0 x 21 1 0 x 08 0 x 21 2 1 0 x 0 B 0 x 21 3 1 0 x 0 D 0 x 21 4 Prefetcher CPU 29

Experimental framework • 3 classical prefetch engines: – Tagged, RPT, and GHB • With 6 different predictors – Upper accuracy threshold: 60% accuracy – Lower accuracy threshold: 20% accuracy – Code Region granularity: Basic Block • With the gem 5 Simulator • Parsec 2. 1 benchmark suite 30

Result analysis: Predicted accuracy Tagged prefetcher RPT prefetcher GHB prefetcher 31

Result analysis: Predicted accuracy Combined technique provides the most accurate prediction Tagged prefetcher RPT prefetcher GHB prefetcher 32

Outline Motivation Prefetching in multi-core platforms Confidence predictor for prefetching in CMPs Dynamic management of prefetch requests Prefetching challenges in DSMs Conclusions 33

Dynamic management of requests • Novel dynamic management technique • Using our confidence predictor • Based on filtering and prioritization • Discard Low confidence requests • Prioritize High and medium confidence requests 34

Confidence predictor based filter • Strategy: – Discard requests – According to its predicted confidence • Problem: – Discarded requests do not generate statistics – Confidence is not updated – Current filtering does not solve this problem • Solution – Warmup to train the prefetch profile table – Dynamic limit of accesses 35

Dynamic warmup strategy Warmed up Not Warmed up p threshold: . 7 f threshold: . 3 warm-up limit: 3 Prefetch profile table @ Reg Id Pos 0 x 04 0 x 21 1 0 x 08 0 x 21 1 0 x 0 B 0 x 21 1 0 x 0 D 0 x 21 1 0 x 0 B 0 x 21 1 Region profile table Pos 1 Pos N Reg Id U !U W 0 x 21 0 2 3 1 1 2 0 1 0 3 2 . . . U !U W 0 0 0 Accuracy: . 6. 5. 31. 5 High Confidence: Medium 36

Confidence predictor based priority 37

Confidence predictor based priority 38

Priority + Filter Region profile table Pos 1 Pos 2 Pos 3 Pos 4 Reg Id U !U 0 x 21 4 0 0 4 2 2 0 3 Prefetcher 1 0 0 x 0 B 0 x 04 0 x 08 0 x 0 D 0 x 21 Not Warmed up p threshold: . 7 f threshold: . 3 0 . 5 4 3 1 2 Warmed up 1 3 2 High Prefetch Queue @ Reg Id Pos Low Prefetch Queue @ Reg Id Pos 0 x 04 0 x 0 B 0 x 21 3 0 x 0 D 0 x 21 4 0 x 21 1 39

Experimental framework • Focused on the RPT stride prefetcher • With 2 baselines (Filtering and prioritization) – – – Upper accuracy threshold: 60% accuracy Lower accuracy threshold: 20% accuracy Code Region granularity: Basic Block Warmup period: 50 iterations Prefetch profile table: 64 entries Region profile table: 32 entries • Parsec 2. 1 benchmark suite 40

Filtering results Baseline filtering Combined predictor filtering 41

Prioritization results 42

Prioritization + Filtering results 43

Prioritization + Filtering results Improving the confidence prediction of the requests can reduce the network congestion 44

Outline Motivation Prefetching in multi-core platforms Confidence predictor for prefetching in CMPs Dynamic management of prefetch requests Prefetching challenges in DSMs Conclusions 45

Distributed memories • Distribution of the memory access pattern: @ @+2 @+4 @+6 @+8 @+10 @ @+2 @+4 @+6 @+8 @ + 10 46

Distributed memories • Distribution of the memory access pattern: @ @+2 @+4 @+6 @+8 @+10 @+12 @+14 TILE 00 TILE 01 TILE 02 TILE 03 @ @+2 @+4 @+6 TILE 04 TILE 05 TILE 06 @+8 @ + 10 @ + 12 TILE 07 @ + 14 47

Prefetch Distributed Memory Systems • Analysis phase Pattern detection challenge DISTRIBUTED L 2 MEMORY @ L 1 MISS for @ 48

Prefetch Distributed Memory Systems • Request generation phase DISTRIBUTED L 2 MEMORY @ @+2 @+4 Queue filtering challenge 49

Prefetch Distributed Memory Systems • Evaluation phase Dynamic profiling challenge DISTRIBUTED L 2 MEMORY @ @+2 @+4 L 1 MISS for @ + 2 50

Challenge evaluation methodology • Three environments to test the challenges • Pattern Detection Challenge: Ideal prefetcher – Prefetcher: GHB • Prefetch Queue Filtering: Centralized queue – Prefetcher: Tagged prefetcher • Dynamic Profiling Challenge: Hardware counters – Prefetcher: Tagged prefetcher 51

Experimental framework • GHB and Tagged prefetcher • With the gem 5 Simulator using – – – 64 tiled x 86 CPUs L 2 prefetchers Ruby memory system MOESI coherency protocol Garnet network simulator • Parsec 2. 1 benchmark suite 52

Pattern Detection Challenge 53

Prefetch Queue Filtering Challenge 54

Dynamic Profiling Challenge 55

Dynamic Profiling Challenge 56

Facing the challenges • There are two main options – Redesign the entire prefetch philosophy – Adapt the current techniques to work with DSMs • Moreover, there are two main directions There is still room for improvement – Centralize the information in this kind of architectures – Handicap of communication increment – Distribute the prefetcher – Handicap of smartly distribute the prefetcher 57

Outline Motivation Prefetching in multi-core platforms Confidence predictor for prefetching in CMPs Dynamic management of prefetch requests Prefetching challenges in DSMs Conclusions 58

Conclusions • We have seen that new technologies provide opportunities on improvement in prefetching. • We have shown that incorporating the effect of the No. C may increase the accuracy of the simulation and lead to more optimal decisions. • We have demonstrated that the contention injected by the prefetcher in the No. C is not negligible and it can increase the memory latency. 59

Conclusions • We have proposed a novel confidence predictor that can accurately predict the usability of prefetching requests before being issued. • We have successfully used our confidence predictor to improve techniques that try to reduce the contention on the network-on-chip. • We have identified unexpected challenges that appear when prefetching in DSMs and we have provided research directions to face them. 60

Contributions • Prefetching evaluation in multi-core platforms – M. Torrents, R. Martínez, C. Molina, An Accurate and Detailed Prefetching Simulation Framework for gem 5, The Second gem 5 User Workshop (GEM 5’ 15), held in conjunction with the 42 nd International Symposium on Computer Architecture(ISCA’ 15), Portland, (USA), June 2015. – M. Torrents, R. Martínez, C. Molina, Network Aware Performance Evaluation of Prefetching Techniques in CMPs, Journal of Simulation Modelling Practice and Theory, Volume 45, June 2014. – M. Torrents, R. Martínez, P. Lopez, J. M. Codina, A. Gonzalez, Comparative Study of Prefetching Mechanisms, In XXI Jornadas de Paralelismo, 2010. 7. 4. 2. 61

Contributions • Confidence predictor mechanisms for prefetching in CMPs – M. Torrents, R. Martínez, C. Molina, CRa. SP: A novel confidence predictor to improve prefetching performance in CMPs, Summited to The 44 th International Symposium on Computer Architecture (ISCA’ 17), Toronto, ON, Canada, June 2017. – M. Torrents, R. Martínez, C. Molina, Improving the Prefetching Performance Through Code Region Profiling, In Proceedings of the 2 nd International BSC Doctoral Symposium (BSC’ 15), Barcelona, (Spain), May 2015. – R. Martínez, E. Gibert, P. Lopez, M. Torrents, et. al, Profiling asynchronous events resulting from the execution of software at code region granularity. US Patent App. 13/993, 054 -2011. 7. 4. 3. 62

Contributions • Improving the Prioritization and Filtering of Prefetch Requests – M. Torrents, R. Martínez, C. Molina, Improving the prioritization and filtering of prefetching request, Tech. Report UPC-DAC-RR-2016 -5, July 2016. 7. 4. 4. • Prefetching Challenges in Distributed Shared Memories for CMPs – M. Torrents, R. Martínez, C. Molina, Facing Prefetching Challenges in Distributed Shared Memories, The Journal of Supercomputing (JSC’ 16), February 2016. – M. Torrents, R. Martínez, C. Molina, Prefetching Challenges in Distributed Memories for CMPs, In Proceedings of the International Conference on Computational Science (ICCS’ 15), Reykjavík, (Iceland), June 2015. 63

Improving Prefetching Mechanisms for Tiled CMP Platforms 22 th July of 2016 Candidate: Advisors: Tutor: Martí Torrents Raúl Martínez, Carlos Molina Antonio González

Reducing the division costs • Calculate and evaluate the accuracy – – – IS (ACC > TH) ? ACC = USEFUL / (USEFUL + NON USEFUL) USEFUL > USEFUL · TH + NON USEFUL · TH USEFUL – USEFUL · TH > NON USEFUL · TH USEFUL (1 – TH) > NON USEFUL · TH If we force the TH to be a percentage multiple of 10 USEFUL 1 -0. 7 > NON USEFUL · 0. 7 3 USEFUL > 7 NON USEFUL 2 USEFUL + USEFUL > 8 NON USEFUL – NON USEFUL (USEFUL << 1) + USEFUL > (NON U. << 3) – NON U. Cost: 2 Shifts + 2 ADD operations + a comparision If Shift operations and ADD are done in paralelel: 3 cicles 65