ELEC 52706270 Spring 2015 LowPower Design of Electronic

  • Slides: 62
Download presentation
ELEC 5270/6270 Spring 2015 Low-Power Design of Electronic Circuits Power Aware Microprocessors Vishwani D.

ELEC 5270/6270 Spring 2015 Low-Power Design of Electronic Circuits Power Aware Microprocessors Vishwani D. Agrawal James J. Danaher Professor Dept. of Electrical and Computer Engineering Auburn University, Auburn, AL 36849 vagrawal@eng. auburn. edu http: //www. eng. auburn. edu/~vagrawal/COURSE/E 6270_Spr 15/course. html Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 1

Year 1999 2002 2005 2008 2011 2014 Feature size (nm) 180 130 100 70

Year 1999 2002 2005 2008 2011 2014 Feature size (nm) 180 130 100 70 50 35 Logic transistors/cm 2 6. 2 M 18 M 39 M 84 M 180 M 390 M Clock (GHz) 1. 25 2. 1 3. 5 6. 0 10. 0 16. 9 Chip size (mm 2) 340 430 520 620 750 900 Power supply (V) 1. 8 1. 5 1. 2 0. 9 0. 6 0. 5 High-perf. Power (W) 90 130 160 175 183 Untrue predictions. SIA Roadmap for Processors (1999) Source: http: //www. semichips. org Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 2

Power Reduction in Processors l Hardware methods: Voltage reduction for dynamic power l Dual-threshold

Power Reduction in Processors l Hardware methods: Voltage reduction for dynamic power l Dual-threshold devices for leakage reduction l Clock gating, frequency reduction l Sleep mode l l Architecture: Instruction set l hardware organization l l Software methods Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 3

Performance Criteria l Throughput – computations per unit time. l Performance is inverse of

Performance Criteria l Throughput – computations per unit time. l Performance is inverse of time – increasing CPU time indicates lower performance. Power – computations per watt. l Energy efficiency – performance/joule. l Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 4

SPEC CPU 2006 Benchmarks l l l Standard Performance Evaluation Corporation (SPEC) http: //www.

SPEC CPU 2006 Benchmarks l l l Standard Performance Evaluation Corporation (SPEC) http: //www. spec. org Twelve integer and 17 floating point programs, CINT 2006 and CFP 2006. Each program run time is normalized to obtain a SPEC ratio with respect to the run time of Sun Ultra Enterprise 2 system with a 296 MHz Ultra. SPARC II processor. It takes about 12 days to run all benchmarks on reference system. CINT 2006 and CFP 2006 metrics are the geometric means of SPEC ratios: l l Peak metric – each program is individually optimized (aggressive compilation). Base metric – common optimization for all programs. Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 5

SPEC CINT 2006 Results l http: //www. spec. org/cpu 2006/results/cint 2006. html l Dell

SPEC CINT 2006 Results l http: //www. spec. org/cpu 2006/results/cint 2006. html l Dell Inc. , Power. Edge R 610 l CPU: Intel Xeon X 5670, 2. 93 GHz l Number of chips 2, cores 12, threads/core 2 l Performance metric 36. 6 base, 39. 4 peak l Dell Inc. Power. Edge M 905 l CPU: AMD Opteron 8381 HE, 2. 50 GHz l Number of chips 4, cores 16, threads/core 1 l Performance metric 15. 8 base, 19. 1 peak Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 6

SPEC CFP 2006 Results l http: //www. spec. org/cpu 2006/results/cfp 2006. html l Dell

SPEC CFP 2006 Results l http: //www. spec. org/cpu 2006/results/cfp 2006. html l Dell Inc. , Power. Edge R 610 l CPU: Intel Xeon X 5670, 2. 93 GHz l Number of chips 2, cores 12, threads/core 2 l Performance metric 42. 5 base, 45. 8 peak l Dell Inc. Power. Edge M 905 l CPU: AMD Opteron 8381 HE, 2. 50 GHz l Number of chips 4, cores 16, threads/core 1 l Performance metric 17. 4 base, 21. 5 peak Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 7

Other Benchmarks l l LINPACK is numerically intensive floating point linear system (Ax =

Other Benchmarks l l LINPACK is numerically intensive floating point linear system (Ax = b) program used for benchmarking supercomputers. SPECPOWER_ssj 2008 measures power and performance of a computer system. l l The initial benchmark addresses the performance of server-side Java; additional workloads are planned. http: //www. spec. org/benchmarks. html#power Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 8

Second Quarter 2010 SPECpower_ssj 2008 Results l http: //www. spec. org/power_ssj 2008/results/res 2010 q

Second Quarter 2010 SPECpower_ssj 2008 Results l http: //www. spec. org/power_ssj 2008/results/res 2010 q 2/ l Apr 7, 2010: Hewlett-Packard Pro. Liant DL 385 G 7 l CPU: AMD Opteron 6174, 2. 2 GHz l Number of chips 2, cores 12, threads/core 2 l Total memory 16 GB l ssj operations @ 100% 888, 819 l Average power @ 100% 271 W l Average power @ active idle 101 W l Overall ssj operations per watt 2, 355 Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 9

Second Quarter 2010 SPECpower_ssj 2008 Results l http: //www. spec. org/power_ssj 2008/results/res 2010 q

Second Quarter 2010 SPECpower_ssj 2008 Results l http: //www. spec. org/power_ssj 2008/results/res 2010 q 2/ l May 19, 2010: Dell Inc. , Power. Edge R 610 l CPU: Intel Xeon X 5670, 2. 93 GHz l Number of chips 2, cores 12, threads 2 l Total memory 12 GB l ssj operations @ 100% 914, 076 l Average power @ 100% 244 W l Average power @ active idle 62. 3 W l Overall ssj operations per watt 2, 938 Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 10

Energy SPEC Benchmarks l Energy efficiency mode: Besides the execution time, energy efficiency of

Energy SPEC Benchmarks l Energy efficiency mode: Besides the execution time, energy efficiency of SPEC benchmark programs is also measured. Energy efficiency of a benchmark program is given by: 1/(Execution time) Energy efficiency = ────── Average power D. A. Patterson and J. L. Hennessy, Computer Organization & Design: The Hardware/Software Interface, 4 th Edition, Morgan Kaufmann Publishers (Elsevier), 2009, Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 11

Energy Efficiency l l Efficiency averaged on n benchmark programs: n 1/n Efficiency =

Energy Efficiency l l Efficiency averaged on n benchmark programs: n 1/n Efficiency = ( Π Efficiencyi ) i=1 where Efficiencyi is the efficiency for program i. Relative efficiency: Efficiency of a computer Relative efficiency = ───────── Eff. of reference computer Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 12

SPEC 2000 Relative Energy Efficiency Always Laptop Min. power max. clock adaptive clk. min.

SPEC 2000 Relative Energy Efficiency Always Laptop Min. power max. clock adaptive clk. min. clock Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 13

Voltage Scaling Dynamic: Reduce voltage and frequency during idle or low activity periods. l

Voltage Scaling Dynamic: Reduce voltage and frequency during idle or low activity periods. l Static: Clustered voltage scaling l l Logic on non-critical paths given lower voltage. l 47% power reduction with 10% area increase reported. l M. Igarashi et al. , “Clustered Voltage Scaling Techniques for Low-Power Design, ” Proc. IEEE Symp. Low Power Design, 1997. Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 14

Processor Utilization Throughput = Operations / second Throughput Compute-intensive processes Maximum throughput Low throughput

Processor Utilization Throughput = Operations / second Throughput Compute-intensive processes Maximum throughput Low throughput (background) processes System idle Time Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 15

Examples of Processes Compute-intensive: spreadsheet, spelling check, video decoding, scientific computing. l Low throughput:

Examples of Processes Compute-intensive: spreadsheet, spelling check, video decoding, scientific computing. l Low throughput: data entry, screen updates, low bandwidth I/O data transfer. l Idle: no computation, no expected output. l Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 16

Effects of Voltage Reduction l Voltage reduction increases delay, decreases throughput: l Slow reduction

Effects of Voltage Reduction l Voltage reduction increases delay, decreases throughput: l Slow reduction in throughput at first l Rapid reduction in throughput for VDD ≤ Vth l Time per operation (TPO) increases l Voltage reduction continues to reduce power consumption: l Energy per operation (EPO) = Power × TPO Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 17

Energy per Operation (EPO) 1. 0 0. 5 EPO Power TPO 0. 0 1

Energy per Operation (EPO) 1. 0 0. 5 EPO Power TPO 0. 0 1 2 3 4 5 VDD / Vth Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 18

Dynamic Voltage and Clock Time spent in: Battery Fast Slow Idle life mode Throughput

Dynamic Voltage and Clock Time spent in: Battery Fast Slow Idle life mode Throughput Always full speed 10% 0% 90% 1 hr Sometimes full speed 1% 90% 9% 5. 3 hrs Rarely full speed 0. 1% 99% 0. 9% 9. 2 hrs T. D. Burd and R. W. Brodersen, Energy Efficient Microprocessors, Springer, 2002, pp. 35 -36. Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 19

Example: Find Minimum Energy Mode l Processor data (rated operation): l 2 GHz clock

Example: Find Minimum Energy Mode l Processor data (rated operation): l 2 GHz clock l 1. 5 volt supply voltage l 0. 5 volt threshold voltage l Power consumption l 50 watts dynamic power l 50 watts static power l Maximum clock frequency for V volt supply (alpha-power law): f Copyright Agrawal, 2007 α ELEC 5270/6270 Spr 15, Lecture 8 (V – VTH)/V 20

Alpha-Power Law Model Variation of delay with supply voltage: delay α VDD /(VDD –

Alpha-Power Law Model Variation of delay with supply voltage: delay α VDD /(VDD – VTH )α VTH = Threshold voltage l α = 1 for short-channel devices, ≈ 2 for long-channel devices l l T. Sakurai and A. R. Newton, “Delay analysis of series-connected MOSFET circuits, ” IEEE Journal of Solid-State Circuits, Vol. 26, pp. 122– 131, Feb. 1991. T. Sakurai and A. R. Newton, “A simple MOSFET model for circuit analysis, ” IEEE Transaction on Electron Devices, Vol. 38, No. 4, pp. 887– 894, Apr. 1991. T. Sakurai, “High-speed circuit design with scaled-down MOSFETs and low supply voltage (invited), ” Proc. IEEE ISCAS, pp. 1487– 1490, Chicago, May 1993. T. Sakurai, “Alpha-Power Law MOS Model, ” IEEE Solid-State Circuits Society Newsletter, Vol. 9, No. 4, pp. 4– 5, Oct. 2004. Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 21

Example Cont. Dynamic power: Pd = CV 2 f = C(1. 5)2× 2× 109

Example Cont. Dynamic power: Pd = CV 2 f = C(1. 5)2× 2× 109 = 50 W C = 11. 11 n. F, capacitance switching/cycle Pd = 11. 11 V 2 f l Dynamic energy per cycle: Ed = Pd/f = 11. 11 V 2 l Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 22

Example Cont. l Clock frequency: f = k (V – VTH)/V = k (1.

Example Cont. l Clock frequency: f = k (V – VTH)/V = k (1. 5 – 0. 5)/1. 5 = 2 GHz k = 3 GHz, a proportionality constant f = 3(V – 0. 5)/V GHz Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 23

Example Cont. Static power: Ps = k’ V 2 = k’ (1. 5)2 =

Example Cont. Static power: Ps = k’ V 2 = k’ (1. 5)2 = 50 W k’ = 22. 22 mho, total leakage conductance Ps = 22. 22 V 2 l Static energy per cycle: Es = Ps/f = 22. 22 V 3/[3(V – 0. 5)] = 7. 41 V 3/(V – 0. 5) l Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 24

Example Cont. Total energy per cycle: E = Ed + Es = 11. 11

Example Cont. Total energy per cycle: E = Ed + Es = 11. 11 V 2 + 7. 41 V 3/(V – 0. 5) l To minimize E, ∂E/∂V = 0, or 5 V 2 – 4. 6 V + 0. 75 = 0 l Solutions of quadratic equation: V = 0. 679 volt, 0. 221 volt l Discard second solution, which is lower than the threshold voltage of 0. 5 volt. l Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 25

Example: Result Voltage 1. 5 V Low energy mode 0. 679 V Clock frequency

Example: Result Voltage 1. 5 V Low energy mode 0. 679 V Clock frequency 2 GHz 791 MHz 60% Dynamic energy/cycle 25. 00 n. J 5. 12 n. J 79. 52% Static energy/cycle 25. 00 n. J 12. 96 n. J 48. 16% Total energy/cycle 50. 0 n. J 18. 08 n. J 63. 84% Dynamic power 50. 0 W 4. 05 W 91. 90% Static power 50. 0 W 10. 25 W 79. 50% Total power 100. 0 W 14. 20 W 85. 80% Rated mode Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 Reduction (%) 54. 7% 26

Cycle Efficiency l l Cycle efficiency is a rating similar to the maximum clock

Cycle Efficiency l l Cycle efficiency is a rating similar to the maximum clock frequency rating. Analogy: l l l Cycle efficiency is similar to miles per gallon (mpg) Maximum clock frequency is similar to miles per hour (mph) Reference: A. Shinde and V. D. Agrawal, “Managing Performance and Efficiency of a Processor, ” Proc. 45 th IEEE Southeastern Symp. System Theory, March 2013. Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 27

Performance in Time q Performance is measured with respect to a program. q Performance

Performance in Time q Performance is measured with respect to a program. q Performance = D. A. Patterson and J. L. Hennessy, Computer Organization & Design, the hardware/Software Interface, Fourth Edition, San Francisco, California: Morgan Kaufman Publishers, Inc. , 2008. Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 28

Performance in Energy (Efficiency) q Efficiency is measured with respect to a program. q

Performance in Energy (Efficiency) q Efficiency is measured with respect to a program. q Efficiency D. A. Patterson and J. L. Hennessy, Computer Organization & Design, the Hardware/Software Interface, Fourth Edition, San Francisco, California: Morgan Kaufman Publishers, Inc. , 2008. Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 29

Two Performances q Time performance q Energy performance D. A. Patterson and J. L.

Two Performances q Time performance q Energy performance D. A. Patterson and J. L. Hennessy, Computer Organization & Design, the Hardware/Software Interface, Fourth Edition, San Francisco, California: Morgan Kaufman Publishers, Inc. , 2008. Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 30

Time Performance Speed of a processor is measured in cycles per second or clock

Time Performance Speed of a processor is measured in cycles per second or clock frequency (f). l Clock period (1/f) is the time per cycle. l Execution time of a program using C clock cycles = C/f l Time performance = 1/(execution time) = f/C l Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 31

Energy Performance Energy efficiency of a processor may be measured in cycles per joule

Energy Performance Energy efficiency of a processor may be measured in cycles per joule or cycle efficiency (η). l 1/η is energy per cycle (EPC). l Energy dissipated by a program using C clock cycles = C × EPC = C/η l Energy performance = η/C l Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 32

Characterizing Device Technology Speed and Efficiency l l l Consider 90 nm CMOS technology.

Characterizing Device Technology Speed and Efficiency l l l Consider 90 nm CMOS technology. Use predictive technology model (PTM). Example circuit: Eight-bit ripple carry adder. Nominal voltage = 1. 2 volts. Simulation for varying operating conditions (VDD = 100 m. V through 1. 2 V) using Spice: l With random vectors for energy per cycle (EPC = 1/η). l With critical path vectors for clock period (1/f). Reference: W. Zhao and Y. Cao, “New Generation of Predictive Technology Model for Sub-45 nm Early Design Exploration, “ IEEE Trans. Electron Devices, vol. 53, no. 11, pp. 2816– 2823, 2006. Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 33

Energy per Cycle of 8 -Bit Adder l K. Kim, “Ultra Low Power CMOS

Energy per Cycle of 8 -Bit Adder l K. Kim, “Ultra Low Power CMOS Design, ” Ph. D Dissertation, Auburn University, Dept. of ECE, Auburn, Alabama, May 2011. 34 Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8

Cycle Time of 8 -Bit Adder l K. Kim, “Ultra Low Power CMOS Design,

Cycle Time of 8 -Bit Adder l K. Kim, “Ultra Low Power CMOS Design, ” Ph. D Dissertation, Auburn University, Dept. of ECE, Auburn, Alabama, May 2011. 35 Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8

Pentium M processor l l l Published data: H. Hanson, K. Rajamani, S. Keckler,

Pentium M processor l l l Published data: H. Hanson, K. Rajamani, S. Keckler, F. Rawson, S. Ghiasi, J. Rubio, “Thermal Response to DVFS: Analysis with an Intel Pentium M, ” Proc. International Symp. Low Power Electronics and Design, 2007, pp. 219 -224. VDD = 1. 2 V Maximum clock rate = 1. 8 GHz Critical path delay, td = 1/1. 8 GHz = 555. 56 ps Power consumption = 120 W EPC = 120/(1. 8 GHz) = 66. 67 n. J Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 36

Cycle Efficiency and Frequency 37 Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8

Cycle Efficiency and Frequency 37 Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8

Example q For a program that executes in 1. 8 billion clock cycles. Voltage

Example q For a program that executes in 1. 8 billion clock cycles. Voltage VDD Frequency f MHz Cycle Efficiency, η Execution Time second Total Energy Consumed Power f/η 1. 2 V 1800 megacycles/s 15 megacycles/joule 1. 0 120 Joules 120 W 0. 6 V 277 megacycles/s 70 megacycles/joule 6. 5 25. 7 Joules 3. 96 W 200 m. V 54. 5 megacycles/s 660 megacycles/joule 33 2. 73 Joules 0. 083 W 38 Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8

Cycle Efficiency q New energy performance rating: Cycle efficiency η; unit is cycles per

Cycle Efficiency q New energy performance rating: Cycle efficiency η; unit is cycles per joule. q Clock frequency f in cycles per second is a similar rating for time performance. q Similarity to other popular ratings: q η → mpg q f → mph q Two ratings allow effective time and energy management of an electronic system. 39 Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8

Problem of Process Variation in Nanometer Technologies Lower Vth Copyright Agrawal, 2007 Clock specification

Problem of Process Variation in Nanometer Technologies Lower Vth Copyright Agrawal, 2007 Clock specification Nominal voltage Vth Higher voltage operation Yield loss due to high leakage Lower voltage operation Number of chips Power specification Yield loss due to slow speed From a presentation: Power Reduction using Long. Run 2 in Transmeta’s Efficon Processor, by D. Ditzel May 17, 2006 Higher Vth ELEC 5270/6270 Spr 15, Lecture 8 40

Clock Distribution H-Tree Fanout, λ = 4 Tree depth, s = logλN No. of

Clock Distribution H-Tree Fanout, λ = 4 Tree depth, s = logλN No. of flip-flops = N clock Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 41

Clock Network Power Pclk = CLVDD 2 f + CLVDD 2 f / λ

Clock Network Power Pclk = CLVDD 2 f + CLVDD 2 f / λ 2 +. . . = CLVDD 2 f where C L = λ = stages – 1 Σ n = 0 1 ─ λn total load capacitance of N flip-flops (a flip-flop is assumed similar to a clock buffer) constant fanout at each stage in distribution network Clock consumes about 40% of total processor power, because (1) Clock is always active (2) Makes two transitions per cycle, (α = 2) (3) Clock gating is useful; inhibit clock to unused blocks Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 42

Upper Bound on Clock Power Pclk = CLVDD 2 f + CLVDD 2 f

Upper Bound on Clock Power Pclk = CLVDD 2 f + CLVDD 2 f / λ 2 +. . . ≤ CLVDD 2 f ∞ Σ n = 0 1 ─ λn ≤ CLVDD 2 f . 1/(1 – 1/ λ) ≤ CLVDD 2 f . λ /(λ – 1) ≤ 1. 333 CLVDD 2 f , because λ = 4 Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 43

Properties of H-Tree Balanced clock skew. l Small delay and power consumption. l Requires

Properties of H-Tree Balanced clock skew. l Small delay and power consumption. l Requires fine-tuning for complex layout. l Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 44

Clock Power and Delay Unit size buffer or inverter delay = d l Total

Clock Power and Delay Unit size buffer or inverter delay = d l Total dynamic power supplied to N flipflops, P = CLVDD 2 f l Total power consumption of clock network: l Flip-flps, N Clock power per flip-flop Clock delay 1 P d 4 P 4 d 16 1. 25 P 8 d 64 1. 3125 P 12 d 128 1. 327125 P 16 d Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 45

Clock Network Examples Alpha 21064 Alpha 21164 Alpha 21264 Technology 0. 75μ CMOS 0.

Clock Network Examples Alpha 21064 Alpha 21164 Alpha 21264 Technology 0. 75μ CMOS 0. 35μ CMOS Frequency (MHz) 200 300 600 Total capacitance 12. 5 n. F Clock load 3. 25 n. F 3. 75 n. F Clock power 40% (20 W) Max. clock skew 200 ps (<10%) 90 ps Clock gating used. Total power 80 110 W D. W. Bailey and B. J. Benschneider, “Clocking Design and Analysis for a 600 -MHz Alpha Microprocessor, ” IEEE J. Solid-State Circuits, vol. 33, no. 11, pp. 1627 -1633, Nov. 1998. Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 46

Architecture Level: Pipeline Gating l A pipeline processor uses speculative execution. l l Idea:

Architecture Level: Pipeline Gating l A pipeline processor uses speculative execution. l l Idea: Stop fetching instructions if a branch hazard is expected: l l Incorrect branch prediction results in pipeline stalls and wasted energy. If the count (M) of incorrect predictions exceeds a prespecified number (N), then suspend fetching instruction for some k cycles. Ref. : S. Manne, A. Klauser and D. Grunwald, “Pipeline Gating: Speculation Control for Energy Reduction, ” Proc. 25 th Annual International Symp. Computer Architecture, June 1998. Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 47

Slack Scheduling l Application: Superscalar, out-of-order execution: An instruction is executed as soon as

Slack Scheduling l Application: Superscalar, out-of-order execution: An instruction is executed as soon as the required data and resources become available. l A commit unit reorders the results. l l l Delay the completion of instructions whose result is not immediately needed. Example of RISC instructions: add l sub l and l or l xor l Copyright Agrawal, 2007 r 0, r 1, r 2; r 3, r 4, r 5; r 9, r 1, r 9; r 5, r 9, r 10; r 2, r 5, r 11; (A) (B) (C) (D) (E) J. Casmira and D. Grunwald, “Dynamic Instruction Scheduling Slack, ” Proc. ACM Kool Chips Workshop, Dec. 2000. ELEC 5270/6270 Spr 15, Lecture 8 48

Slack Scheduling Example Standard scheduling A B C D E Slack scheduling A B

Slack Scheduling Example Standard scheduling A B C D E Slack scheduling A B C D E Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 49

Slack Scheduling logic Re-order buffer Low-power execution units Slack bit Copyright Agrawal, 2007 ELEC

Slack Scheduling logic Re-order buffer Low-power execution units Slack bit Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 50

Power Reduction Example l l l l Alpha 21064: 200 MHz @ 3. 45

Power Reduction Example l l l l Alpha 21064: 200 MHz @ 3. 45 V, power dissipation = 26 W Reduce voltage to 1. 5 V, power (x 0. 189) = 4. 9 W Eliminate FP unit, power (x 0. 33) = 1. 6 W Scale 0. 75μ → 0. 35μ, power (x 0. 5) = 0. 8 W Reduce clock load, power (x 0. 75) = 0. 6 W Reduce frequency 200 → 160 MHz, power (x 0. 8) = 0. 48 W J. Montanaro et al. , “A 160 -MHz, 32 -b, 0. 5 -W CMOS RISC Microprocessor, ” IEEE J. Solid-State Circuits, vol. 31, no. 11, pp. 1703 -1714, Nov. 1996. Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 51

Why Asynchronous Design? Clock consumes about 40% of total power and limits performance. l

Why Asynchronous Design? Clock consumes about 40% of total power and limits performance. l Benefits of asynchronous design: l l Low power: clock power eliminated. l Higher performance: clock speed in a pipeline depends on the slowest stage. l Modularity: modules in a clock-less system operate autonomously. l Hurdles: Design, verification, testing, yield. Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 52

Clock Power / Total Power 1. 0 0. 8 0. 6 Logic to flip-flop

Clock Power / Total Power 1. 0 0. 8 0. 6 Logic to flip-flop ratio = 0 0. 4 5 10 0. 2 0. 0 20 0. 1 0. 2 Activity α 0. 3 0 K. Van Berkel, et al. , “Asynchronous Does Not Imply Low Power, But. . . , ” Low-Power CMOS Design, A. P. Chandrakasan and R. Brodersen (Eds. ), New York: IEEE Press, 1998, pp. 227 -232. Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 53

Asynchronous Systems No clock. l Self-timed systems: l l Encoded signals l Timing signal

Asynchronous Systems No clock. l Self-timed systems: l l Encoded signals l Timing signal l Signaling protocols: l Sender sends a request l Receiver acknowledges Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 54

GALS: Globally Asynchronous, Locally Synchronous module with locally generated clock Copyright Agrawal, 2007 Self-timed

GALS: Globally Asynchronous, Locally Synchronous module with locally generated clock Copyright Agrawal, 2007 Self-timed or protocol-driven signals ELEC 5270/6270 Spr 15, Lecture 8 Synchronous module with locally generated clock 55

AMULET 2 e (1996) Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 56

AMULET 2 e (1996) Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 56

AMULET 2 e (1996) Asynchronous ARM 8. l 0. 5 micron CMOS, 6. 4

AMULET 2 e (1996) Asynchronous ARM 8. l 0. 5 micron CMOS, 6. 4 mm × 6. 4 mm, 3. 3 V. l 454 k transistors (cache 328 k, processor core 93 k, control and I/O 33 k). l 150 m. W at 40 MIPS, similar to sync. ARM 8. l 3μW in idle state. l http: //apt. cs. manchester. ac. uk/projects/pro cessors/amulet/AMULET 2_u. P. php l Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 57

U. Manchester, CS Dept. http: //apt. cs. manchester. ac. uk/projects/processors/amulet/AM ULET 3 i_seminar. pdf

U. Manchester, CS Dept. http: //apt. cs. manchester. ac. uk/projects/processors/amulet/AM ULET 3 i_seminar. pdf Asynchronous logic: can be competitive with ‘conventional’ designs has particular advantages with low-power and low EMI (think portable systems) may be the only solution to some tasks on big chips especially block interconnections But: designing big systems is a lot of work it’s hard to catch up with the big companies Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 58

References on Async. Design David A. Huffman, The Synthesis of Sequential Switching Circuits, MIT,

References on Async. Design David A. Huffman, The Synthesis of Sequential Switching Circuits, MIT, 1953. l Stephen H. Unger, Asynchronous Sequential Switching Circuits, Wiley. Interscience, 1969. l Chris J. Myers, Asynchronous Circuit Design, John Wiley & Sons, Inc. , 2001. l Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 59

References on Async. Processors l l S. B. Furber, “Asynchronous Design, ” Chapter 7

References on Async. Processors l l S. B. Furber, “Asynchronous Design, ” Chapter 7 in Low Power Design in Deep Submicron Electronics, W. Nebel and J. Mermet (Editors), Springer, 1997. A. J. Martin, M. Nyström and C. G. Wong, “Three Generations of Asynchronous Microprocessors, ” Caltech, CS Dept. , Pasadena, CA, available from http: //www. async. caltech. edu/Pubs/PDF/2003_threegen. pdf I. E. Sutherland, "Turing Award: Micropipeline, " Comm. ACM, vol. 32, no. 6, pp. 720 -738, June 1989 http: //www. eng. auburn. edu/~vagrawal/COURSE/READING/ARCH/Micro pipeline_sutherland. pdf I. E. Sutherland, "The Tyranny of the Clock, " Comm. ACM, vol. 55, no. 10, pp. 35 -36, Oct 2012 http: //www. eng. auburn. edu/~vagrawal/COURSE/READING/ARCH/Suthe rland_Tyranny_o_Clock. pdf Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 60

For More on Microprocessors T. D. Burd and R. W. Brodersen, Energy Efficient Microprocessor

For More on Microprocessors T. D. Burd and R. W. Brodersen, Energy Efficient Microprocessor Design, Springer, 2002. l R. Graybill and R. Melhem, Power Aware Computing, New York: Plenum Publishers, 2002. l Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 61

Class Project l l l Assigned April 6, 2015. Clear understanding of the problem

Class Project l l l Assigned April 6, 2015. Clear understanding of the problem expected. Conduct to the point analysis. Reliable (reproducible) data. Meaningful conclusions usable by others. A readable four to six page report (due on 4/27/15) written and formatted like a technical paper (PDF). Include data but do not attach printouts. Copyright Agrawal, 2007 ELEC 5270/6270 Spr 15, Lecture 8 62