Lecture on High Performance Processor Architecture CS 05162

  • Slides: 28
Download presentation
Lecture on High Performance Processor Architecture (CS 05162) Limits on Instruction-Level Parallelism An Hong

Lecture on High Performance Processor Architecture (CS 05162) Limits on Instruction-Level Parallelism An Hong han@ustc. edu. cn Fall 2007 University of Science and Technology of China Department of Computer Science and Technology CS of USTC AN Hong

Limits to ILP n Conflicting studies of amount − Benchmarks (vectorized Fortran FP vs.

Limits to ILP n Conflicting studies of amount − Benchmarks (vectorized Fortran FP vs. integer C programs) − Hardware sophistication − Compiler sophistication n How much ILP is available using existing mechanisms with increasing HW budgets? n Do we need to invent new HW/SW mechanisms to keep on processor performance curve? − DLPs: Intel MMX, SSE 2; Stream Processors − TLPs: IBM Power 5 (SMT/CMP) − PCAs: RAW, Smart Memory, TRIPS − etc. 2021/9/6 CS of USTC AN Hong 2

Overcoming Limits n Advances in compiler technology + significantly new and different hardware techniques

Overcoming Limits n Advances in compiler technology + significantly new and different hardware techniques may be able to overcome limitations assumed in studies n However, unlikely such advances when coupled with realistic hardware will overcome these limits in near future 2021/9/6 CS of USTC AN Hong 3

Limits to ILP n Initial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine

Limits to ILP n Initial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start: 1. Register renaming − infinite virtual registers => all register WAW & WAR hazards are avoided 2. Branch prediction − perfect; no mispredictions 3. Jump prediction − all jumps perfectly predicted (returns, case statements) 2 & 3 no control dependencies; perfect speculation & an unbounded buffer of instructions available 4. Memory-address alias analysis − addresses known & a load can be moved before a store provided addresses not equal; 1&4 eliminates all but RAW n Also: perfect caches; 1 cycle latency for all instructions (FP *, /); unlimited instructions issued/clock cycle; 2021/9/6 CS of USTC AN Hong 4

Limits to ILP HW Model comparison Model Power 5 Instructions Issued per clock Infinite

Limits to ILP HW Model comparison Model Power 5 Instructions Issued per clock Infinite 4 Instruction Window Size Infinite 200 Renaming Registers Infinite 48 integer + 40 Fl. Pt. Branch Prediction Perfect 2% to 6% misprediction (Tournament Branch Predictor) Cache Perfect 64 KI, 32 KD, 1. 92 MB L 2, 36 MB L 3 Memory Alias Analysis Perfect ? ? 2021/9/6 CS of USTC AN Hong 5

Upper Limit to ILP: Ideal Machine Instructions Per Clock FP: 75 - 150 2021/9/6

Upper Limit to ILP: Ideal Machine Instructions Per Clock FP: 75 - 150 2021/9/6 Integer: 18 - 60 CS of USTC AN Hong 6

Limits to ILP HW Model comparison New Model Power 5 Instructions Issued per clock

Limits to ILP HW Model comparison New Model Power 5 Instructions Issued per clock Infinite Instruction Window Size Infinite, 2 K, Infinite 512, 128, 32 200 Renaming Registers Infinite 48 integer + 40 Fl. Pt. Branch Prediction Perfect 2% to 6% misprediction Cache Perfect 64 KI, 32 KD, 1. 92 MB L 2, 36 MB L 3 Memory Alias Perfect ? ? 2021/9/6 4 (Tournament Branch Predictor) CS of USTC AN Hong 7

More Realistic HW: Window Impact Change from Infinite window 2048, 512, 128, 32 Integer:

More Realistic HW: Window Impact Change from Infinite window 2048, 512, 128, 32 Integer: 18 - 60 FP: 75 - 150 FP: 9 - 150 Integer: 8 - 63 2021/9/6 CS of USTC AN Hong 8

Limits to ILP HW Model comparison New Model Power 5 Instructions Issued per clock

Limits to ILP HW Model comparison New Model Power 5 Instructions Issued per clock 64 Infinite 4 Instruction Window Size 2048 Infinite 200 Renaming Registers Infinite 48 integer + 40 Fl. Pt. Branch Prediction Perfect vs. 8 K Tournament vs. 512 2 -bit vs. profile vs. none Perfect 2% to 6% misprediction Cache Perfect 64 KI, 32 KD, 1. 92 MB L 2, 36 MB L 3 Memory Alias Perfect ? ? 2021/9/6 (Tournament Branch Predictor) CS of USTC AN Hong 9

More Realistic HW: Branch Impact FP: 75 - 150 FP: 15 - 45 Change

More Realistic HW: Branch Impact FP: 75 - 150 FP: 15 - 45 Change from Infinite window to examine to 2048 and maximum issue of 64 instructions per clock cycle IPC Integer: 18 - 60 2021/9/6 Integer: 6 - 12 Perfect Tournament BHT (512) CS of USTC AN Hong Profile No prediction 10

Misprediction Rates 2021/9/6 CS of USTC AN Hong 11

Misprediction Rates 2021/9/6 CS of USTC AN Hong 11

Limits to ILP HW Model comparison Instructions Issued per clock New Model Power 5

Limits to ILP HW Model comparison Instructions Issued per clock New Model Power 5 64 Infinite 200 Instruction 2048 Window Size Renaming Registers Infinite vs. 256, Infinite 128, 64, 32, none 48 integer + 40 Fl. Pt. Branch Prediction 8 K 2 -bit Perfect Tournament Branch Predictor Cache Perfect 64 KI, 32 KD, 1. 92 MB L 2, 36 MB L 3 Memory Alias Perfect 2021/9/6 CS of USTC AN Hong 12

More Realistic HW: Renaming Register Impact (N int + N fp) IPC Change 2048

More Realistic HW: Renaming Register Impact (N int + N fp) IPC Change 2048 instr window, 64 instr issue, 8 K 2 level Prediction 2021/9/6 FP: 11 - 45 Integer: 5 - 15 Infinite USTC AN 64 Hong 256 CS of 128 32 None 13

Limits to ILP HW Model comparison New Model Power 5 64 Infinite 4 Instruction

Limits to ILP HW Model comparison New Model Power 5 64 Infinite 4 Instruction 2048 Window Size Infinite 200 Renaming Registers 256 Int + 256 FP Infinite 48 integer + 40 Fl. Pt. Branch Prediction 8 K 2 -bit Perfect Tournament Cache Perfect 64 KI, 32 KD, 1. 92 MB L 2, 36 MB L 3 Memory Alias Perfect vs. Stack Perfect vs. Inspect vs. none Instructions Issued per clock 2021/9/6 CS of USTC AN Hong Perfect 14

More Realistic HW: Memory Address Alias Impact Change 2048 instr window, 64 instr issue,

More Realistic HW: Memory Address Alias Impact Change 2048 instr window, 64 instr issue, 8 K 2 level Prediction, 256 renaming registers Integer: 4 - 9 IPC 2021/9/6 FP: 4 - 45 (Fortran, no heap) Perfect Global/Stack perf; Inspec. CS of USTC AN Hong heap conflicts Assem. None 15

Limits to ILP HW Model comparison New Model Power 5 64 (no restrictions) Infinite

Limits to ILP HW Model comparison New Model Power 5 64 (no restrictions) Infinite 4 Instruction Infinite vs. 256, Window Size 128, 64, 32 Infinite 200 Renaming Registers 64 Int + 64 FP Infinite 48 integer + 40 Fl. Pt. Branch Prediction 1 K 2 -bit Perfect Tournament Cache Perfect 64 KI, 32 KD, 1. 92 MB L 2, 36 MB L 3 Memory Alias HW disambiguation Perfect Instructions Issued per clock 2021/9/6 CS of USTC AN Hong 16

Realistic HW: Window Impact IPC Perfect disambiguation (HW), 1 K Selective Prediction, 16 entry

Realistic HW: Window Impact IPC Perfect disambiguation (HW), 1 K Selective Prediction, 16 entry return, 64 registers, issue as many as window 2021/9/6 FP: 8 - 45 Integer: 6 - 12 USTC AN 32 Hong Infinite 256 128 CS of 64 16 8 4 17

Analysis of the ILP Limit What Went Wrong? n Preserving sequential semantics while reordering

Analysis of the ILP Limit What Went Wrong? n Preserving sequential semantics while reordering instructions is hard--esp. in hardware n Limits to reordering − Branches: control flow limit − loads and stores: data flow limit 2021/9/6 CS of USTC AN Hong 18

Analysis of the ILP Limit The Dynamically-Scheduled Superscalar Model Instructions are sequenced and retired

Analysis of the ILP Limit The Dynamically-Scheduled Superscalar Model Instructions are sequenced and retired in program order (Fixed-size window) n 抽取ILP的方法 − 建立一个指令窗口,确定控制依赖 − 确定和最小化该窗口中指令间的数据依赖 − 调度指令并行执行 n 软件抽取ILP的方法/硬件抽取ILP的方法 2021/9/6 CS of USTC AN Hong 19

Analysis of the ILP Limit The Dynamically-Scheduled Superscalar Model Instructions are sequenced and retired

Analysis of the ILP Limit The Dynamically-Scheduled Superscalar Model Instructions are sequenced and retired in program order (Fixed-size window) n In-order sequencing establishes the correct data dependences between instructions required to implement the meaning of the program 2021/9/6 CS of USTC AN Hong 20

Analysis of the ILP Limit The Dynamically-Scheduled Superscalar Model Instructions are sequenced and retired

Analysis of the ILP Limit The Dynamically-Scheduled Superscalar Model Instructions are sequenced and retired in program order (Fixed-size window) Issue 1: No enough ready-to-execute useful instuctions due to two kinds of interruptions: At the sequencing end (Direct interruptions) § Instruction cache misses § Branch mispredictions 2021/9/6 At the retirement end (Indirect interruptions) § long execution latencies § FP divide § Load (data cache miss) CS of USTC AN Hong 21

Analysis of the ILP Limit The Dynamically-Scheduled Superscalar Model Instructions are sequenced and retired

Analysis of the ILP Limit The Dynamically-Scheduled Superscalar Model Instructions are sequenced and retired in program order (Fixed-size window) Issue 2: Not conductive to high processor utilization due to sequencing order and global data-driven order rarely match ! § Execution should take place in global data-driven order(dataflow order), § but execution order is constrained by sequencing order(controlflow order). 2021/9/6 CS of USTC AN Hong 22

Analysis of the ILP Limit n ILP的提高受限于硬件复杂度 − Dynamically re-order instructions to fill multiple

Analysis of the ILP Limit n ILP的提高受限于硬件复杂度 − Dynamically re-order instructions to fill multiple execution units − Must preserve sequential semantics=>require dependency checking − Complexity grows as product of instructions in flight and number of execution units − The work by Sun, IBM, Compaq indicates that a superscalar width of about 4 is the current cost vs. Performance point 2021/9/6 CS of USTC AN Hong 23

Analysis of the ILP Limit n Doubling issue rates above today’s 3 -6 instructions

Analysis of the ILP Limit n Doubling issue rates above today’s 3 -6 instructions per clock, say to 6 to 12 instructions, probably requires a processor to − issue 3 or 4 data memory accesses per cycle, − resolve 2 or 3 branches per cycle, − rename and access more than 20 registers per cycle, and − fetch 12 to 24 instructions per cycle. n The complexities of implementing these capabilities is likely to mean sacrifices in the maximum clock rate − E. g, widest issue processor is the Itanium 2, but it also has the slowest clock rate, despite the fact that it consumes the most power! 2021/9/6 CS of USTC AN Hong 24

Limits to ILP n Most ILP techniques for increasing performance increase power consumption n

Limits to ILP n Most ILP techniques for increasing performance increase power consumption n Multiple issue processors techniques all are energy inefficient: 1. Issuing multiple instructions incurs some overhead in logic that grows faster than the issue rate grows 2. Growing gap between peak issue rates and sustained performance n Number of transistors switching = f(peak issue rate), and performance = f( sustained rate), growing gap between peak and sustained performance increasing energy per unit of performance n The key question is whether a technique is energy efficient: does it increase power consumption faster than it increases performance? 2021/9/6 CS of USTC AN Hong 26

How to Exceed ILP Limits of this study? 2021/9/6 CS of USTC AN Hong

How to Exceed ILP Limits of this study? 2021/9/6 CS of USTC AN Hong 27

Performance beyond single thread ILP n There can be much higher natural parallelism in

Performance beyond single thread ILP n There can be much higher natural parallelism in some applications (e. g. , Database or Scientific codes) n Explicit Thread Level Parallelism or Data Level Parallelism n Thread: process with own instructions and data − thread may be a process part of a parallel program of multiple processes, or it may be an independent program − Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute n Data Level Parallelism: Perform same operations on data, and lots of data 2021/9/6 CS of USTC AN Hong 28