ELEC 52706270 Spring 2009 LowPower Design of Electronic

  • Slides: 35
Download presentation
ELEC 5270/6270 Spring 2009 Low-Power Design of Electronic Circuits Power Aware Microprocessors Vishwani D.

ELEC 5270/6270 Spring 2009 Low-Power Design of Electronic Circuits Power Aware Microprocessors Vishwani D. Agrawal James J. Danaher Professor Dept. of Electrical and Computer Engineering Auburn University, Auburn, AL 36849 vagrawal@eng. auburn. edu http: //www. eng. auburn. edu/~vagrawal/COURSE/E 6270_Spr 09/course. html Copyright Agrawal, 2007 ELEC 6270 Spring 09, Lecture 12 1

SIA Roadmap for Processors (1999) Year 1999 2002 2005 2008 2011 2014 Feature size

SIA Roadmap for Processors (1999) Year 1999 2002 2005 2008 2011 2014 Feature size (nm) 180 130 100 70 50 35 Logic transistors/cm 2 6. 2 M 18 M 39 M 84 M 180 M 390 M Clock (GHz) 1. 25 2. 1 3. 5 6. 0 10. 0 16. 9 Chip size (mm 2) 340 430 520 620 750 900 Power supply (V) 1. 8 1. 5 1. 2 0. 9 0. 6 0. 5 High-perf. Power (W) 90 130 160 175 183 Source: http: //www. semichips. org Copyright Agrawal, 2007 ELEC 6270 Spring 09, Lecture 12 2

Power Reduction in Processors l l Just about everything is used. Hardware methods: Voltage

Power Reduction in Processors l l Just about everything is used. Hardware methods: Voltage reduction for dynamic power l Dual-threshold devices for leakage reduction l Clock gating, frequency reduction l Sleep mode l l Architecture: Instruction set l hardware organization l l Software methods Copyright Agrawal, 2007 ELEC 6270 Spring 09, Lecture 12 3

SPEC CPU 2000 Benchmarks l l l Twelve integer and 14 floating point programs,

SPEC CPU 2000 Benchmarks l l l Twelve integer and 14 floating point programs, CINT 2000 and CFP 2000. Each program run time is normalized to obtain a SPEC ratio with respect to the run time of Sun Ultra 5_10 with a 300 MHz processor. CINT 2000 and CFP 2000 summary measurements are the geometric means of SPEC ratios. LINPACK is numerically intensive floating point linear system (Ax = b) program used for benchmarking supercomputers. SPECPOWER_ssj 2008 measures power and performance of a computer system. Copyright Agrawal, 2007 ELEC 6270 Spring 09, Lecture 12 4

Reference CPU s: Sun Ultra 5_10 300 MHz Processor Copyright Agrawal, 2007 ELEC 6270

Reference CPU s: Sun Ultra 5_10 300 MHz Processor Copyright Agrawal, 2007 ELEC 6270 Spring 09, Lecture 12 5

CINT 2000: 3. 4 GHz Pentium 4, HT Technology (D 850 MD Motherboard) SPECint

CINT 2000: 3. 4 GHz Pentium 4, HT Technology (D 850 MD Motherboard) SPECint 2000_base = 1341 SPECint 2000 = 1389 Source: www. spec. org Copyright Agrawal, 2007 ELEC 6270 Spring 09, Lecture 12 6

Two Benchmark Results l Baseline: A uniform configuration not optimized for specific program: l

Two Benchmark Results l Baseline: A uniform configuration not optimized for specific program: l Same compiler with same settings and flags used for all benchmarks l Other restrictions l Peak: Run is optimized for obtaining the peak performance for each benchmark program. Copyright Agrawal, 2007 ELEC 6270 Spring 09, Lecture 12 7

CFP 2000: 3. 6 GHz Pentium 4, HT Technology (D 925 XCV/AA-400 Motherboard) SPECfp

CFP 2000: 3. 6 GHz Pentium 4, HT Technology (D 925 XCV/AA-400 Motherboard) SPECfp 2000_base = 1627 SPECfp 2000 = 1630 Source: www. spec. org Copyright Agrawal, 2007 ELEC 6270 Spring 09, Lecture 12 8

CINT 2000: 1. 7 GHz Pentium 4 (D 850 MD Motherboard) SPECint 2000_base =

CINT 2000: 1. 7 GHz Pentium 4 (D 850 MD Motherboard) SPECint 2000_base = 579 SPECint 2000 = 588 Source: www. spec. org Copyright Agrawal, 2007 ELEC 6270 Spring 09, Lecture 12 9

CFP 2000: 1. 7 GHz Pentium 4 (D 850 MD Motherboard) SPECfp 2000_base =

CFP 2000: 1. 7 GHz Pentium 4 (D 850 MD Motherboard) SPECfp 2000_base = 648 SPECfp 2000 = 659 Source: www. spec. org Copyright Agrawal, 2007 ELEC 6270 Spring 09, Lecture 12 10

Energy SPEC Benchmarks l Energy efficiency mode: Besides the execution time, energy efficiency of

Energy SPEC Benchmarks l Energy efficiency mode: Besides the execution time, energy efficiency of SPEC benchmark programs is also measured. Energy efficiency of a benchmark program is given by: 1/(Execution time) Energy efficiency = ────── joules consumed Copyright Agrawal, 2007 ELEC 6270 Spring 09, Lecture 12 11

Energy Efficiency l l Efficiency averaged on n benchmark programs: n 1/n Efficiency =

Energy Efficiency l l Efficiency averaged on n benchmark programs: n 1/n Efficiency = ( Π Efficiencyi ) i=1 where Efficiencyi is the efficiency for program i. Relative efficiency: Efficiency of a computer Relative efficiency = ───────── Eff. of reference computer Copyright Agrawal, 2007 ELEC 6270 Spring 09, Lecture 12 12

SPEC 2000 Relative Energy Efficiency Always Laptop Min. power max. clock adaptive clk. min.

SPEC 2000 Relative Energy Efficiency Always Laptop Min. power max. clock adaptive clk. min. clock Copyright Agrawal, 2007 ELEC 6270 Spring 09, Lecture 12 13

Voltage Scaling Dynamic: Reduce voltage and frequency during idle or low activity periods. l

Voltage Scaling Dynamic: Reduce voltage and frequency during idle or low activity periods. l Static: Clustered voltage scaling l l Logic on non-critical paths given lower voltage. l 47% power reduction with 10% area increase reported. l M. Igarashi et al. , “Clustered Voltage Scaling Techniques for Low-Power Design, ” Proc. IEEE Symp. Low Power Design, 1997. Copyright Agrawal, 2007 ELEC 6270 Spring 09, Lecture 12 14

Processor Utilization Throughput = Operations / second Throughput Compute-intensive processes Maximum throughput Low throughput

Processor Utilization Throughput = Operations / second Throughput Compute-intensive processes Maximum throughput Low throughput (background) processes System idle Time Copyright Agrawal, 2007 ELEC 6270 Spring 09, Lecture 12 15

Examples of Processes Compute-intensive: spreadsheet, spelling check, video decoding, scientific computing. l Low throughput:

Examples of Processes Compute-intensive: spreadsheet, spelling check, video decoding, scientific computing. l Low throughput: data entry, screen updates, low bandwidth I/O data transfer. l Idle: no computation, no expected output. l Copyright Agrawal, 2007 ELEC 6270 Spring 09, Lecture 12 16

Effects of Voltage Reduction l Voltage reduction increases delay, decreases throughput: l Slow reduction

Effects of Voltage Reduction l Voltage reduction increases delay, decreases throughput: l Slow reduction in throughput at first l Rapid reduction in throughput for VDD ≤ Vth l Time per operation (TPO) increases l Voltage reduction continues to reduce power consumption: l Energy Copyright Agrawal, 2007 per operation (EPO) = Power × TPO ELEC 6270 Spring 09, Lecture 12 17

Energy per Operation (EPO) 1. 0 0. 5 EPO Power TPO 0. 0 1

Energy per Operation (EPO) 1. 0 0. 5 EPO Power TPO 0. 0 1 2 3 4 5 VDD / Vth Copyright Agrawal, 2007 ELEC 6270 Spring 09, Lecture 12 18

Dynamic Voltage and Clock Time spent in: Throughput Fast Slow Idle mode Battery life

Dynamic Voltage and Clock Time spent in: Throughput Fast Slow Idle mode Battery life Always full speed 10% 0% 90% 1 hr Sometimes full speed 1% 90% 9% 5. 3 hrs Rarely full speed 0. 1% 99% 0. 9% 9. 2 hrs T. D. Burd and R. W. Brodersen, Energy Efficient Microprocessors, Springer, 2002, pp. 35 -36. Copyright Agrawal, 2007 ELEC 6270 Spring 09, Lecture 12 19

Example: Find Minimum Energy Mode l Processor data (rated operation): l 2 GHz clock

Example: Find Minimum Energy Mode l Processor data (rated operation): l 2 GHz clock l 1. 5 volt supply voltage l 0. 5 volt threshold voltage l Power consumption l 50 watts dynamic power l 50 watts static power l Maximum f Copyright Agrawal, 2007 clock frequency for V volt supply α (V – VTH)/V ELEC 6270 Spring 09, Lecture 12 20

Example Cont. Dynamic power: Pd = CV 2 f = C(1. 5)2× 2× 109

Example Cont. Dynamic power: Pd = CV 2 f = C(1. 5)2× 2× 109 = 50 W C = 11. 11 n. F, capacitance switching/cycle Pd = 11. 11 V 2 f l Dynamic energy per cycle: Ed = Pd/f = 11. 11 V 2 l Copyright Agrawal, 2007 ELEC 6270 Spring 09, Lecture 12 21

Example Cont. l Clock frequency: f = k (V – VTH)/V = k (1.

Example Cont. l Clock frequency: f = k (V – VTH)/V = k (1. 5 – 0. 5)/1. 5 = 2 GHz k = 3 GHz, a proportionality constant f = 3(V – 0. 5)/V GHz Copyright Agrawal, 2007 ELEC 6270 Spring 09, Lecture 12 22

Example Cont. Static power: Ps = k’ V 2 = k’ (1. 5)2 =

Example Cont. Static power: Ps = k’ V 2 = k’ (1. 5)2 = 50 W k’ = 22. 22 mho, total leakage conductance Ps = 22. 22 V 2 l Static energy per cycle: Es = Ps/f = 22. 22 V 3/[3(V – 0. 5)] = 7. 41 V 3/(V – 0. 5) l Copyright Agrawal, 2007 ELEC 6270 Spring 09, Lecture 12 23

Example Cont. Total energy per cycle: E = Ed + Es = 11. 11

Example Cont. Total energy per cycle: E = Ed + Es = 11. 11 V 2 + 7. 41 V 3/(V – 0. 5) l To minimize E, ∂E/∂V = 0, or 5 V 2 – 4. 6 V + 0. 75 = 0 l Solutions of quadratic equation: V = 0. 679 volt, 0. 221 volt l Discard second solution, which is lower than the threshold voltage of 0. 5 volt. l Copyright Agrawal, 2007 ELEC 6270 Spring 09, Lecture 12 24

Example: Result Voltage 1. 5 V Low energy mode 0. 679 V Clock frequency

Example: Result Voltage 1. 5 V Low energy mode 0. 679 V Clock frequency 2 GHz 791 MHz 60% Dynamic energy/cycle 25. 00 n. J 5. 12 n. J 79. 52% Static energy/cycle 25. 00 n. J 12. 96 n. J 48. 16% Total energy/cycle 50. 0 n. J 18. 08 n. J 63. 84% Dynamic power 50. 0 W 4. 05 W 91. 90% Static power 50. 0 W 10. 25 W 79. 50% Total power 100. 0 W 14. 20 W 85. 80% Rated mode Copyright Agrawal, 2007 ELEC 6270 Spring 09, Lecture 12 Reduction (%) 54. 7% 25

Power specification Clock specification Lower voltage operation Higher voltage operation Number of chips Problem

Power specification Clock specification Lower voltage operation Higher voltage operation Number of chips Problem of Process Variation in Nanometer Tecchnologies Yield loss due to high leakage Lower Vth Copyright Agrawal, 2007 Nominal voltage Vth Yield loss due to slow speed From a presentation: Power Reduction using Long. Run 2 in Transmeta’s Efficon Processor, by D. Ditzel May 17, 2006 Higher Vth ELEC 6270 Spring 09, Lecture 12 26

Pipeline Gating l A pipeline processor uses speculative execution. l l Idea: Stop fetching

Pipeline Gating l A pipeline processor uses speculative execution. l l Idea: Stop fetching instructions if a branch hazard is expected: l l Incorrect branch prediction results in pipeline stalls and wasted energy. If the count (M) of incorrect predictions exceeds a prespecified number (N), then suspend fetching instruction for some k cycles. Ref. : S. Manne, A. Klauser and D. Grunwald, “Pipeline Gating: Speculation Control for Energy Reduction, ” Proc. 25 th Annual International Symp. Computer Architecture, June 1998. Copyright Agrawal, 2007 ELEC 6270 Spring 09, Lecture 12 27

Slack Scheduling l Application: Superscalar, out-of-order execution: An instruction is executed as soon as

Slack Scheduling l Application: Superscalar, out-of-order execution: An instruction is executed as soon as the required data and resources become available. l A commit unit reorders the results. l l l Delay the completion of instructions whose result is not immediately needed. Example of RISC instructions: l l l add r 0, r 1, r 2; sub r 3, r 4, r 5; and r 9, x 1, r 9; or r 5, r 9, r 10; xor r 2, r 10, r 11; Copyright Agrawal, 2007 (A) (B) (C) (D) (E) J. Casmira and D. Grunwald, “Dynamic Instruction Scheduling Slack, ” Proc. ACM Kool Chips Workshop, Dec. 2000. ELEC 6270 Spring 09, Lecture 12 28

Slack Scheduling Example Standard scheduling A B C D E Slack scheduling A B

Slack Scheduling Example Standard scheduling A B C D E Slack scheduling A B C D E Copyright Agrawal, 2007 ELEC 6270 Spring 09, Lecture 12 29

Slack Scheduling logic Re-order buffer Low-power execution units Slack bit Copyright Agrawal, 2007 ELEC

Slack Scheduling logic Re-order buffer Low-power execution units Slack bit Copyright Agrawal, 2007 ELEC 6270 Spring 09, Lecture 12 30

Clock Distribution clock Copyright Agrawal, 2007 ELEC 6270 Spring 09, Lecture 12 31

Clock Distribution clock Copyright Agrawal, 2007 ELEC 6270 Spring 09, Lecture 12 31

Clock Power Pclk = CLVDD 2 f + CLVDD 2 f / λ 2

Clock Power Pclk = CLVDD 2 f + CLVDD 2 f / λ 2 +. . . = CLVDD 2 f where CL = λ = stages – 1 Σ n=0 1 ─ λn total load capacitance constant fanout at each stage in distribution network Clock consumes about 40% of total processor power. Copyright Agrawal, 2007 ELEC 6270 Spring 09, Lecture 12 32

Clock Network Examples Alpha 21064 Alpha 21164 Alpha 21264 Technology 0. 75μ CMOS 0.

Clock Network Examples Alpha 21064 Alpha 21164 Alpha 21264 Technology 0. 75μ CMOS 0. 35μ CMOS Frequency (MHz) 200 300 600 Total capacitance 12. 5 n. F Clock load 3. 25 n. F 3. 75 n. F Clock power 40% (20 W) Max. clock skew 200 ps (<10%) 90 ps Clock gating used. Total power 80 110 W D. W. Bailey and B. J. Benschneider, “Clocking Design and Analysis for a 600 -MHz Alpha Microprocessor, ” IEEE J. Solid-State Circuits, vol. 33, no. 11, pp. 1627 -1633, Nov. 1998. Copyright Agrawal, 2007 ELEC 6270 Spring 09, Lecture 12 33

Power Reduction Example l l l l Alpha 21064: 200 MHz @ 3. 45

Power Reduction Example l l l l Alpha 21064: 200 MHz @ 3. 45 V, power dissipation = 26 W Reduce voltage to 1. 5 V, power (5. 3 x) = 4. 9 W Eliminate FP, power (3 x) = 1. 6 W Scale 0. 75→ 0. 35μ, power (2 x) = 0. 8 W Reduce clock load, power (1. 3 x) = 0. 6 W Reduce frequency 200→ 160 MHz, power (1. 25 x) = 0. 5 W J. Montanaro et al. , “A 160 -MHz, 32 -b, 0. 5 -W CMOS RISC Microprocessor, ” IEEE J. Solid-State Circuits, vol. 31, no. 11, pp. 1703 -1714, Nov. 1996. Copyright Agrawal, 2007 ELEC 6270 Spring 09, Lecture 12 34

For More on Microprocessors T. D. Burd and R. W. Brodersen, Energy Efficient Microprocessor

For More on Microprocessors T. D. Burd and R. W. Brodersen, Energy Efficient Microprocessor Design, Springer, 2002. l R. Graybill and R. Melhem, Power Aware Computing, New York: Plenum Publishers, 2002. l Copyright Agrawal, 2007 ELEC 6270 Spring 09, Lecture 12 35