CS 15 447 Computer Architecture Lecture 27 Power

Uniprocessor Performance (SPECint) From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4 th

Three walls 1. ILP Wall: 1. 2. 3. 4. Wall: not enough parallelism available

Multithreaded Processors • What support is needed? • I can use it to help

Power-Efficient Processor Design Goals: 1. Understand why energy efficiency is important 2. Learn the

Why Worry About Power? • Embedded systems: – Battery life • High-end processors: –

Why worry about power -- Oakridge Lab. Jaguar • Current highest performance super computer

Peak Power in Today’s CPUs • • • Alpha 21264 AMD Athlon XP HP

9 15 -447 Computer Architecture Fall 2008 ©

Where is This Power Coming From? • Sources of power consumption in CMOS: –

Power Consumption in CMOS – Dynamic Power Consumption • Charging and discharging capacitors Vdd

Dynamic Power Consumption Capacitance: function of wire length, transistor size Clock frequency: increasing Power=

Power Consumption in CMOS – Short-circuit power • Both PMOS and NMOS are conducting

Power Consumption in CMOS – Leakage power – transistors are not perfect switches and

Cooling • All of the consumed power has to be dissipated • Done by

Voltage Scaling • Transistor switches slower at lower voltage. • Leakage current grows exponentially

Technology Scaling: the Enabler • New process generation every 2 -3 years • Ideal

Ideal Process Shrink: the Results • 2/3 reduction in energy/transition (CV 2 0. 7

Process Technology – the Reality* • Performance does not scale w/ frequency – New

Ugly Numbers* i 486 (0. 8 ) Pentium 4 (0. 18 ) Factor Transistors

The Bottom Line • Circuits and process scaling alone can no longer solve all

Microarchitectural Techniques for Power Reduction 23 15 -447 Computer Architecture Fall 2008 ©

A Superscalar Datapath Performance=N*f*IPC Function Units Instruction Issue IQ F 1 F 2 D

Microarchitectural Techniques—General Approach • Dynamic power: – Reduce the activity factor – Reduce the

Guideline • If we reduce voltage, linear drop in maximum frequency (and performance) •

Examples: Front-End Throttling • • Speculation is used to increase performance Wasted energy if

Front-End Throttling (continued) • Just-in Time Instruction Delivery – Fetch stage is throttled based

Energy Reduction in the Register Files • General solutions: – Use of multi-banked RFs.

Energy Reduction in the Register Files • Value Aging Buffer – At the time

Isolation of short-lived operands 31 15 -447 Computer Architecture Fall 2008 ©

Out-of-Order Execution and In-Order Retirement Inst. Queue F R In-order front end Ex ARF

Register Renaming • Used to cope with false data dependencies. • A new physical

Register Renaming: the Implementation – Register Alias Table (RAT) maintains the mappings between logical

Register Renaming: the Implementation – Rename Table (RT) is used to maintain the mappings

Short-Lived Values • Definition: a value is short-lived if the destination register is renamed

Percentage of Short-Lived Values 96 -entry ROB, 4 -way processor As 15 -447 Computer

Why Keep Them ? • Reasons for maintaining short-lived values: – Recovering from branch

Energy-dissipating Events Ex Inst. Queue F R D ARF Write In-order front end ROB

Isolating Short-Lived Values: the Idea Write short-lived values into a small dedicated RF (SRF)

Energy Reduction in Caches • Dynamically resizable caches – Dynamically estimates the program requirements

Energy Reduction within the Execution Units • Gating off portions of the execution units

Encoding Addresses for Low Power • Using Grey code for the addresses to reduce

Encoding Data for Low Power • Bus-invert encoding – Uses redundancy to reduce the

OS and Compiler Techniques • Can compiler help? • Can OS help? – E.

Slides: 47

Download presentation

CS 15 -447: Computer Architecture Lecture 27 Power Aware Architecture Design November 24, 2007 Nael Abu-Ghazaleh naelag@cmu. edu http: //www. qatar. cmu. edu/~msakr/15447 -f 08 15 -447 Computer Architecture Fall 2008 ©

Uniprocessor Performance (SPECint) From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4 th edition, Sept. 15, 2006 3 X ? ? %/year Sea change in chip design —what is emerging? • VAX : 25%/year 1978 to 1986 • RISC + x 86: 52%/year 1986 to 2002 • RISC + x 86: ? ? %/year 2002 to present 15 -447 Computer Architecture 2 Fall 2008 ©

Three walls 1. ILP Wall: 1. 2. 3. 4. Wall: not enough parallelism available in one thread Very costly to find more Implications: cant continue to grow IPC VLIW? SIMD ISA extensions? 2. Memory Wall: 1. 2. 3. 4. Growing gap between DRAM and Processor speed Caching helps, but only so much Implications: cache misses are getting more expensive Multithreaded processors? 3. Physics/Power Wall: 1. Cant continue to shrink devices; running into physical limits 2. Power dissipation is also increasing (more today) 3. Implications: cant rely on performance boost from shrinking transistors 4. But we will continue to get more transistors 3 15 -447 Computer Architecture Fall 2008 ©

Multithreaded Processors • What support is needed? • I can use it to help ILP as well – Which designs help ILP in the picture to the right? 4 15 -447 Computer Architecture Fall 2008 ©

Power-Efficient Processor Design Goals: 1. Understand why energy efficiency is important 2. Learn the sources of energy dissipation 3. Overview a selection of approaches to reduce energy 5 15 -447 Computer Architecture Fall 2008 ©

Why Worry About Power? • Embedded systems: – Battery life • High-end processors: – Cooling (costs $1 per chip per Watt if operating @ >40 W) – Power cost: 15 cents/Kilo. Watt hr (KWH) • A single 900 Watt server costs 100 USD /month to run, not including cooling costs! – Packaging – Reliability 6 15 -447 Computer Architecture Fall 2008 ©

Why worry about power -- Oakridge Lab. Jaguar • Current highest performance super computer – 1. 3 sustained petaflops (quadrillion FP operations per second) – 45, 000 processors, each quad-core AMD Opteron • 180, 000 cores! – 362 Terabytes of memory; 10 petabytes disk space – Check top 500. org for a list of the most powerful supercomputers • Power consumption? (without cooling) – 7 Mega. Watts! – 0. 75 million USD/month to power – There is a green 500. org that rates computers based on flops/Watt 7 15 -447 Computer Architecture Fall 2008 ©

Peak Power in Today’s CPUs • • • Alpha 21264 AMD Athlon XP HP PA-8700 IBM Power 4 Intel Itanium Intel Xeon 95 W 67 W 75 W 130 W 59 W Even worse when we consider power density (watt/cm 2) 8 15 -447 Computer Architecture Fall 2008 ©

9 15 -447 Computer Architecture Fall 2008 ©

Where is This Power Coming From? • Sources of power consumption in CMOS: – Dynamic or active power (due to the switching of transistors) – Short-circuit power – Leakage power • High temperature increases power consumption – Silicon is a bad conductor: higher temperature ->higher leakage current->even higher temperature… 10 15 -447 Computer Architecture Fall 2008 ©

Power Consumption in CMOS – Dynamic Power Consumption • Charging and discharging capacitors Vdd E=CV 2 In Out 0 1 C E=CV 2 In Out 1 0 C P=E*f=C*V 2*f 15 -447 Computer Architecture 11 Fall 2008 ©

Dynamic Power Consumption Capacitance: function of wire length, transistor size Clock frequency: increasing Power= *C*V 2*f Activity factor: how often do wires switch Supply voltage: has been dropping with successive process generations 12 15 -447 Computer Architecture Fall 2008 ©

Power Consumption in CMOS – Short-circuit power • Both PMOS and NMOS are conducting Vdd Isc In 1/2 Out C About 2% of the overall power. 15 -447 Computer Architecture 13 Fall 2008 ©

Power Consumption in CMOS – Leakage power – transistors are not perfect switches and they leak. Vdd In Out 0 Isub 1 C 20% now, expect 40% in next technology and growing 15 -447 Computer Architecture Fall 2008 © 14

Cooling • All of the consumed power has to be dissipated • Done by means of heat pipes, heat sinks, fans, etc. • Different segments use different cooling mechanisms. • Costs $1 -$3 or more per chip per Watt if operating @ >40 W • We may soon need budgets for liquid-cooling or refrigeration hardware. 15 15 -447 Computer Architecture Fall 2008 ©

Voltage Scaling • Transistor switches slower at lower voltage. • Leakage current grows exponentially with decreases in threshold voltage • Leakage power goes through the roof 17 15 -447 Computer Architecture Fall 2008 ©

Technology Scaling: the Enabler • New process generation every 2 -3 years • Ideal shrink for 30% reduction in size: – Voltage scales down by 30% – Gate delays are shortened by 30% ~50% frequency gain (500 ps cycle = 2 GHz clock, 333 ps cycle = 3 GHz clock) – Transistor density increases by 2 X • 0. 7 X shrink on a side, 2 X area reduction – Capacitance/transistor reduced by 30% 18 15 -447 Computer Architecture Fall 2008 ©

Ideal Process Shrink: the Results • 2/3 reduction in energy/transition (CV 2 0. 7 x 0. 72 = 0. 34 X) • 1/2 reduction in power (CV 2 f 0. 7 x 0. 72 x 1. 5= 0. 5 X • But twice as many transistors, or more if area increases • Power density unchanged Looks good! 15 -447 Computer Architecture 19 Fall 2008 ©

Process Technology – the Reality* • Performance does not scale w/ frequency – New designs increase frequency by 2 X – New designs use 2 X-3 X more transistors to get 1. 4 X-1. 8 X performance* • So, every new process generation: – Power goes up by about 2 X (3 X transistors * 2 X switches * 1/3 energy) – Leakage power is also increasing – Power density goes up 30%~80% (2 X power / 1. X area) • Will get worse in future technologies, because Voltage will scale down less *Source: “Power – the Next Frontier: a Microarchitecture Perspective”, Ronny Ronen, Keynote speech at PACS’ 02 Workshop. 15 -447 Computer Architecture Fall 2008 © 20

Ugly Numbers* i 486 (0. 8 ) Pentium 4 (0. 18 ) Factor Transistors 1. 2 M 42 M 35 x Frequency 50 MHz 2000 MHz 40 x Voltage 5 V 1. 65 V 1/3 x Peak Power 5 W 100 W 20 x 0. 73 cm 2 2. 17 cm 2 3 x 6. 8 W/cm 2 46 W/cm 2 7 x Die size Power density 21 15 -447 Computer Architecture Fall 2008 ©

The Bottom Line • Circuits and process scaling alone can no longer solve all power problems • SYSTEMS must also be power-aware – OS – Compilers – Architecture • Techniques at the architectural level are needed to reduce the absolute power dissipation as well as the power density 22 15 -447 Computer Architecture Fall 2008 ©

Microarchitectural Techniques for Power Reduction 23 15 -447 Computer Architecture Fall 2008 ©

A Superscalar Datapath Performance=N*f*IPC Function Units Instruction Issue IQ F 1 F 2 D 1 D 2 Architectural Register File FU 1 FU 2 ARF ROB Fetch FUm Decode/Dispatch LSQ Instruction dispatch EX D-cache Result/status forwarding buses Actually, it’s the whole system, but we focus on processor 24 15 -447 Computer Architecture Fall 2008 ©

Microarchitectural Techniques—General Approach • Dynamic power: – Reduce the activity factor – Reduce the switching capacitance (usually not possible) – Reduce the voltage/frequency (speedstep; e. g. , 1. 6 GHz pentium M can be clocked down to 600 MHz, voltage can be dropped from 1. 48 V to 0. 95 V) • Leakage power: – Put some portions of the on-chip storage structures in a lowpower stand-by mode or even completely shutting off the power supply to these partitions – Resizing • We usually give up some performance to save energy, but how much? 25 15 -447 Computer Architecture Fall 2008 ©

Guideline • If we reduce voltage, linear drop in maximum frequency (and performance) • “The cube law”: P=k. V 3 (~1%V=3%P) – If we use voltage scaling we can approximately trade 1% of performance loss for 3% of power reduction. • Any architectural technique that trades performance for power should do better than that (or at least as good). Otherwise simple voltage scaling can be used to achieve better tradeoffs. 26 15 -447 Computer Architecture Fall 2008 ©

Examples: Front-End Throttling • • Speculation is used to increase performance Wasted energy if it is wrong Can we speculate only when we think we’ll be right? Gating: temporarily prevent the new instructions from entering the pipeline • Use Gating to avoid speculation beyond the branches with low prediction accuracy – The number of unresolved low-confidence branches is used to determine when to gate the pipeline and for how long – Report 38% energy savings in the wrong-path instructions with about 1% of IPC loss 27 15 -447 Computer Architecture Fall 2008 ©

Front-End Throttling (continued) • Just-in Time Instruction Delivery – Fetch stage is throttled based on the number of in-flight instructions. – If the number of in-flight instructions exceeds a predetermined threshold, the fetch is throttled – Threshold is adjusted through the “tuning cycle” – Reasons for energy savings: • Fewer instructions are processed along the mispredicted path • Instruction spends fewer cycles in the issue queue 28 15 -447 Computer Architecture Fall 2008 ©

Energy Reduction in the Register Files • General solutions: – Use of multi-banked RFs. Each bank has fewer entries and fewer ports than the monolithic RF. • Problems: – Possible bank conflicts -> IPC loss – Overhead of the port arbitration logic – Use of the smaller cache-like structures to exploit the access locality 29 15 -447 Computer Architecture Fall 2008 ©

Energy Reduction in the Register Files • Value Aging Buffer – At the time of writeback, the results are written into a FIFO-style cache called VAB – The RF is updated only when the values are evicted from the VAB. – In many situations, this can be avoided because a register may be deallocated during its residency in the VAB – If a register is read from the VAB, there is no need to access the RF. – Some performance loss due to the sequential access to the VAB and the RF. 30 15 -447 Computer Architecture Fall 2008 ©

Out-of-Order Execution and In-Order Retirement Inst. Queue F R In-order front end Ex ARF D ROB Out-of-order core 15 -447 Computer Architecture In-order retirement 32 Fall 2008 ©

Register Renaming • Used to cope with false data dependencies. • A new physical register is allocated for EVERY new result • P 6 style: ROB slots serve as physical registers LOAD R 1, R 2, 100 LOAD P 31, P 2, 100 SUB R 5, R 1, R 3 SUB P 32, P 31, P 3 ADD R 1, R 5, R 4 ADD P 33, P 32, P 4 33 15 -447 Computer Architecture Fall 2008 ©

Register Renaming: the Implementation – Register Alias Table (RAT) maintains the mappings between logical and physical registers Arch. Reg Phys. Reg. Location (0 -ROB, 1 -ARF) 0 0 1 1 2 2 1 3 3 1 4 4 1 5 5 1 15 -447 Computer Architecture Original code LOAD R 1, R 2, 100 SUB R 5, R 1, R 3 ADD R 1, R 5, R 4 34 Fall 2008 ©

Register Renaming: the Implementation – Register Alias Table (RAT) maintains the mappings between logical and physical registers Arch. Reg Phys. Reg. Location (0 -ROB, 1 -ARF) 0 0 1 1 31 0 2 2 1 3 3 1 4 4 1 5 5 1 Original code LOAD R 1, R 2, 100 SUB R 5, R 1, R 3 ADD R 1, R 5, R 4 15 -447 Computer Architecture Renamed code LOAD P 31, R 2, 100 35 Fall 2008 ©

Register Renaming: the Implementation – Rename Table (RT) is used to maintain the mappings between logical and physical registers Arch. Reg Phys. Reg. Location (0 -ROB, 1 -ARF) 0 0 1 1 31 0 2 2 1 3 3 1 4 4 1 5 32 0 Original code LOAD R 1, R 2, 100 SUB R 5, R 1, R 3 ADD R 1, R 5, R 4 15 -447 Computer Architecture Renamed code LOAD P 31, R 2, 100 SUB P 32, P 31, R 3 36 Fall 2008 ©

Register Renaming: the Implementation – Rename Table (RT) is used to maintain the mappings between logical and physical registers Arch. Reg Phys. Reg. Location (0 -ROB, 1 -ARF) 0 0 1 1 33 0 2 2 1 3 3 1 4 4 1 5 32 0 Original code LOAD R 1, R 2, 100 SUB R 5, R 1, R 3 ADD R 1, R 5, R 4 15 -447 Computer Architecture Renamed code LOAD P 31, R 2, 100 SUB P 32, P 31, R 3 ADD P 33, P 32, R 4 37 Fall 2008 ©

Short-Lived Values • Definition: a value is short-lived if the destination register is renamed by the time of the result generation. • Identified one cycle before the result writeback • A large percentage of all generated results are short-lived for SPEC 2000 benchmarks. RENAMER LOAD R 1, R 2, 100 SUB R 5, R 1, R 3 ADD R 1, R 5, R 4 LOAD P 31, R 2, 100 SUB P 32, P 31, R 3 ADD P 33, P 32, R 4 38 15 -447 Computer Architecture Fall 2008 ©

Why Keep Them ? • Reasons for maintaining short-lived values: – Recovering from branch mispredictions – Reconstructing precise state if interrupts or exceptions occur LOAD R 1, R 2, 100 SUB R 5, R 1, R 3 ADD R 1, R 5, R 4 LOAD P 31, R 2, 100 SUB P 32, P 31, R 3 ADD P 33, P 32, R 4 40 15 -447 Computer Architecture Fall 2008 ©

Energy-dissipating Events Ex Inst. Queue F R D ARF Write In-order front end ROB Out-of-order core 15 -447 Computer Architecture Read In-order retirement 41 Fall 2008 ©

Isolating Short-Lived Values: the Idea Write short-lived values into a small dedicated RF (SRF) Ex Inst. Queue F R D In-order front end LOAD R 1, R 2, 100 SUB R 5, R 1, R 3 ADD R 1, R 5, R 4 SRF ROB Out-of-order core 15 -447 Computer Architecture ARF Write Read In-order retirement 42 Fall 2008 ©

Energy Reduction in Caches • Dynamically resizable caches – Dynamically estimates the program requirements and adapts to the required cache size – Cache is upsized or downsized at the end of periodic intervals based on the value of the cache miss counter – Downsizing puts the higher-numbered sets into a lowleakage mode using sleep transistors – A bit mask is used to specify the number of address bits that are used for indexing into the set – The cache size always changes by a factor of two 43 15 -447 Computer Architecture Fall 2008 ©

Energy Reduction within the Execution Units • Gating off portions of the execution units – Disables the upper bits of the ALUs where they are not needed (for small operands) – Energy can be reduced by 54% for integer programs • Packaging multiple narrow-width operations in a single ALU in the same cycle • Steering instructions to FUs based on the criticality information – Critical instructions are steered to fast and power-hungry execution units, non-critical instructions are steered to slow and power-efficient units 44 15 -447 Computer Architecture Fall 2008 ©

Encoding Addresses for Low Power • Using Grey code for the addresses to reduce switching activity on the address buses (Su et. al. , IEEE Design and Test, 1994) – Exploits the observation that programs often generate consecutive addresses – Grey code: there is only a single transition on the address bus when consecutive addresses are accessed – 37% reduction in the switching activity is reported – A Gray code encoder is placed at the transmitting end of the bus, and a decoder is needed at the receiving end 45 15 -447 Computer Architecture Fall 2008 ©

Encoding Data for Low Power • Bus-invert encoding – Uses redundancy to reduce the number of transitions – Adds one line to the bus to indicate if the actual data or its complement is transmitted – If the Hamming distance between the current value and the previous one is less than or equal to (n/2) (for n bits), the value is transmitted as such and the value of 0 is transmitted on the extra line. – Otherwise, the complement of the value is transmitted and the extra line is set to 1 – The average number of bus transitions per clock cycle is lowered by 25% as a result 46 15 -447 Computer Architecture Fall 2008 ©

OS and Compiler Techniques • Can compiler help? • Can OS help? – E. g. , control voltage scaling – Control turning off devices 47 15 -447 Computer Architecture Fall 2008 ©