The 1 st JILP Data Prefetching Championship DPC1

  • Slides: 15
Download presentation
The 1 st JILP Data Prefetching Championship (DPC-1) Enhancement for Accurate Stream Prefetching Gang

The 1 st JILP Data Prefetching Championship (DPC-1) Enhancement for Accurate Stream Prefetching Gang Liu 1, Zhuo Huang 1, Jih-Kwon Peir 1, Xudong Shi 2, Lu Peng 3 1. University of Florida 2. Google Inc 3. Louisiana State University

Outline q. Introduction q. Enhancement techniques ØIntegrating stride prefetching ØStream repetition ØNoise removal ØDead

Outline q. Introduction q. Enhancement techniques ØIntegrating stride prefetching ØStream repetition ØNoise removal ØDead stream removal q. Performance Evaluation DPC-1 2

Background q. Data prefetching ØMiss address regularity üStride üStream üDistance ØMiss address correlation üCorrelation

Background q. Data prefetching ØMiss address regularity üStride üStream üDistance ØMiss address correlation üCorrelation üMarkov üHot Stream DPC-1 3

Stream Prefetcher q. Training a Stream: 3 consecutive block misses in a small region

Stream Prefetcher q. Training a Stream: 3 consecutive block misses in a small region (16 blocks) in the same direction miss sequence 100 102 104 503 504 501 499 100 102 104 503 504 501 Trained! 501 499 1 st 2 nd 3 rd miss DPC-1 Fail! Trained! 1 st 2 nd 3 rd miss 4

Stream Prefetcher q. Prefetching: Stream direction memory access start addr end addr Monitored region

Stream Prefetcher q. Prefetching: Stream direction memory access start addr end addr Monitored region original addr prefetch distance DPC-1 prefetch degree 5

Enhance #1 – integrating stride prefetching q. Constant stride (from Art) Memory Allocation: ØExample:

Enhance #1 – integrating stride prefetching q. Constant stride (from Art) Memory Allocation: ØExample: Code segment from Art for (i=0; i<numf 1 s; i++) &bus[i+1][j] - &bus[i][j] = 192 bytes { //numf 1 s = 10000, numf 2 s = 11 Regular stream prefetches: … 100, 101, 102, 103, 104, 105… bus[i] = (double *)malloc(numf 2 s*sizeof(double)); tds[i] = (double *)malloc(numf 2 s*sizeof(double)); } Memory Access: for (tj=0; tj<numf 2 s; tj++) { Y[tj]. y = 0; &bus[i][j] &bus[i+1][j] = 3 blocks if ( !Y[tj]. reset ) for (ti=0; ti<numf 1 s; ti++) Y[tj]. y += f 1_layer[ti]. P * bus[ti][tj]; DPC-1 6 }

Enhance #1 – integrating stride prefetching q. Stream w/ stride Stream direction start addr

Enhance #1 – integrating stride prefetching q. Stream w/ stride Stream direction start addr memory access end addr Monitored region original addr prefetch degree * stride prefetch distance * stride DPC-1 7

Enhance #2 – stream repetition q. Early prefeching of repeated streams memory Memory access

Enhance #2 – stream repetition q. Early prefeching of repeated streams memory Memory access Stream Access: direction Repeated stream: separated for (tj=0; tj<numf 2 s; tj++) { start addr end addr by a few instructions Y[tj]. y = 0; Monitored if (Monitored !Y[tj]. reset ) forregion (ti=0; ti<numf 1 s; ti++) region Y[tj]. y += f 1_layer[ti]. P * bus[ti][tj]; } original addr prefetch distance prefetch degree DPC-1 8

Enhance #3 – noise removal q. Special noise prevents stream being trained missed block

Enhance #3 – noise removal q. Special noise prevents stream being trained missed block sequence: 106, 107, 104, 105, 102, 103 106 107 104 105 102 103 106 104 107 105 104 1 st 2 nd 3 rd miss Regular Training 106 107 104 105 104 102 1 st Fail! 2 nd 3 rd miss Training w/ noise removal DPC-1 Succeed ! Ignore noise 9

Enhance #4 – dead stream removal q. Dead stream ØInactive for a long time

Enhance #4 – dead stream removal q. Dead stream ØInactive for a long time (10 k/100 k cycles) ØStream is short (<128 blocks) q. Dead-streams’ first prefetching Stream table size = 128 85% unused DPC-1 10

Performance Evaluation q. Evaluation Ø 12 high MPKI SPEC 2000/SPEC 2006 benchmarks ØCMPsim, 3

Performance Evaluation q. Evaluation Ø 12 high MPKI SPEC 2000/SPEC 2006 benchmarks ØCMPsim, 3 configurations (c 1, c 2, c 3) ØL 2 prefetching only Prefetcher GHB-distance Configuration 256 IT entries, 256 GHB entries, prefetch width/depth = 16/16 Stream 8 combined entries, prefetch distance/degree = 64/4 Enhance-Stream 8 stream entries, 16 training entries, prefetch distance/degree = 64/4 DPC-1 Size 4 KB 64 B 256 B 11

Stride 46. 4% prefetching 1. 9 1. 7 1. 2 Improvement over no prefetching

Stride 46. 4% prefetching 1. 9 1. 7 1. 2 Improvement over no prefetching CPI Comparison 3. 4 2. 3 1. 0 23. 2% art 37. 6% soplex C 3: Noise removal Stream vs Stream repetition C 1: 5. 5% about 1% Enhanced DPC-1 1. 8% C 2: 17. 6% C 3: 12 18. 7%

Sensitivity on stream table size Best case: 8/16 DPC-1 13

Sensitivity on stream table size Best case: 8/16 DPC-1 13

Effect of dead stream removal In size 16, Swim 5% better than size 8

Effect of dead stream removal In size 16, Swim 5% better than size 8 DPC-1 14

Conclusion q 37. 6%, 41. 6%, and 54. 5% better than no prefetching for

Conclusion q 37. 6%, 41. 6%, and 54. 5% better than no prefetching for c 1, c 2, c 3 respectively. q 1. 8%, 17. 6%, and 18. 7% better than original stream prefetcher. qhardware overhead is very little. DPC-1 15