The 1 st JILP Data Prefetching Championship DPC1
- Slides: 15
The 1 st JILP Data Prefetching Championship (DPC-1) Enhancement for Accurate Stream Prefetching Gang Liu 1, Zhuo Huang 1, Jih-Kwon Peir 1, Xudong Shi 2, Lu Peng 3 1. University of Florida 2. Google Inc 3. Louisiana State University
Outline q. Introduction q. Enhancement techniques ØIntegrating stride prefetching ØStream repetition ØNoise removal ØDead stream removal q. Performance Evaluation DPC-1 2
Background q. Data prefetching ØMiss address regularity üStride üStream üDistance ØMiss address correlation üCorrelation üMarkov üHot Stream DPC-1 3
Stream Prefetcher q. Training a Stream: 3 consecutive block misses in a small region (16 blocks) in the same direction miss sequence 100 102 104 503 504 501 499 100 102 104 503 504 501 Trained! 501 499 1 st 2 nd 3 rd miss DPC-1 Fail! Trained! 1 st 2 nd 3 rd miss 4
Stream Prefetcher q. Prefetching: Stream direction memory access start addr end addr Monitored region original addr prefetch distance DPC-1 prefetch degree 5
Enhance #1 – integrating stride prefetching q. Constant stride (from Art) Memory Allocation: ØExample: Code segment from Art for (i=0; i<numf 1 s; i++) &bus[i+1][j] - &bus[i][j] = 192 bytes { //numf 1 s = 10000, numf 2 s = 11 Regular stream prefetches: … 100, 101, 102, 103, 104, 105… bus[i] = (double *)malloc(numf 2 s*sizeof(double)); tds[i] = (double *)malloc(numf 2 s*sizeof(double)); } Memory Access: for (tj=0; tj<numf 2 s; tj++) { Y[tj]. y = 0; &bus[i][j] &bus[i+1][j] = 3 blocks if ( !Y[tj]. reset ) for (ti=0; ti<numf 1 s; ti++) Y[tj]. y += f 1_layer[ti]. P * bus[ti][tj]; DPC-1 6 }
Enhance #1 – integrating stride prefetching q. Stream w/ stride Stream direction start addr memory access end addr Monitored region original addr prefetch degree * stride prefetch distance * stride DPC-1 7
Enhance #2 – stream repetition q. Early prefeching of repeated streams memory Memory access Stream Access: direction Repeated stream: separated for (tj=0; tj<numf 2 s; tj++) { start addr end addr by a few instructions Y[tj]. y = 0; Monitored if (Monitored !Y[tj]. reset ) forregion (ti=0; ti<numf 1 s; ti++) region Y[tj]. y += f 1_layer[ti]. P * bus[ti][tj]; } original addr prefetch distance prefetch degree DPC-1 8
Enhance #3 – noise removal q. Special noise prevents stream being trained missed block sequence: 106, 107, 104, 105, 102, 103 106 107 104 105 102 103 106 104 107 105 104 1 st 2 nd 3 rd miss Regular Training 106 107 104 105 104 102 1 st Fail! 2 nd 3 rd miss Training w/ noise removal DPC-1 Succeed ! Ignore noise 9
Enhance #4 – dead stream removal q. Dead stream ØInactive for a long time (10 k/100 k cycles) ØStream is short (<128 blocks) q. Dead-streams’ first prefetching Stream table size = 128 85% unused DPC-1 10
Performance Evaluation q. Evaluation Ø 12 high MPKI SPEC 2000/SPEC 2006 benchmarks ØCMPsim, 3 configurations (c 1, c 2, c 3) ØL 2 prefetching only Prefetcher GHB-distance Configuration 256 IT entries, 256 GHB entries, prefetch width/depth = 16/16 Stream 8 combined entries, prefetch distance/degree = 64/4 Enhance-Stream 8 stream entries, 16 training entries, prefetch distance/degree = 64/4 DPC-1 Size 4 KB 64 B 256 B 11
Stride 46. 4% prefetching 1. 9 1. 7 1. 2 Improvement over no prefetching CPI Comparison 3. 4 2. 3 1. 0 23. 2% art 37. 6% soplex C 3: Noise removal Stream vs Stream repetition C 1: 5. 5% about 1% Enhanced DPC-1 1. 8% C 2: 17. 6% C 3: 12 18. 7%
Sensitivity on stream table size Best case: 8/16 DPC-1 13
Effect of dead stream removal In size 16, Swim 5% better than size 8 DPC-1 14
Conclusion q 37. 6%, 41. 6%, and 54. 5% better than no prefetching for c 1, c 2, c 3 respectively. q 1. 8%, 17. 6%, and 18. 7% better than original stream prefetcher. qhardware overhead is very little. DPC-1 15
- Data prefetching championship
- Jilp
- Prefetching relevant priors
- Championship branch prediction
- Cache replacement championship
- Sai infotech
- Our championship chapter 1
- Hát kết hợp bộ gõ cơ thể
- Bổ thể
- Tỉ lệ cơ thể trẻ em
- Chó sói
- Thang điểm glasgow
- Bài hát chúa yêu trần thế alleluia
- Môn thể thao bắt đầu bằng chữ đua
- Thế nào là hệ số cao nhất