LowPower Design of Digital VLSI Circuits Multicore Design

  • Slides: 23
Download presentation
Low-Power Design of Digital VLSI Circuits Multicore Design for Low Power Vishwani D. Agrawal

Low-Power Design of Digital VLSI Circuits Multicore Design for Low Power Vishwani D. Agrawal James J. Danaher Professor Dept. of Electrical and Computer Engineering Auburn University, Auburn, AL 36849 vagrawal@eng. auburn. edu http: //www. eng. auburn. edu/~vagrawal Copyright Agrawal, 2011 Lecture 15: Multicore Design 1

Low-Power Datapath Architecture l Lower supply voltage l l l This slows down circuit

Low-Power Datapath Architecture l Lower supply voltage l l l This slows down circuit speed Use parallel computing to gain the speed back Works well when threshold voltage is also lowered. About 60% reduction in power obtainable. Reference: A. P. Chandrakasan and R. W. Brodersen, Low Power Digital CMOS Design, Boston: Kluwer Academic Publishers (Now Springer), 1995. Copyright Agrawal, 2011 Lecture 15: Multicore Design 2

Input Combinational logic Register A Reference Datapath Output Cref CK Supply voltage Total capacitance

Input Combinational logic Register A Reference Datapath Output Cref CK Supply voltage Total capacitance switched per cycle Clock frequency Power consumption: Pref Copyright Agrawal, 2011 Lecture 15: Multicore Design = Vref = Cref =f = Cref. Vref 2 f 3

Comb. Logic Copy 2 Multiphase Clock gen. and mux control f/N Register f/N N

Comb. Logic Copy 2 Multiphase Clock gen. and mux control f/N Register f/N N = Deg. of parallelism Register Comb. Logic Copy 1 Supply voltage: VN ≤ V 1 = Vref N to 1 multiplexer Input Register Each copy processes every Nth input, operates at f/N reduced voltage Register A Parallel Architecture Output f Comb. Logic Copy N CK Copyright Agrawal, 2011 Lecture 15: Multicore Design 4

Level Converter: L to H VDDH Transistors with thicker oxide and longer channels Vout_H

Level Converter: L to H VDDH Transistors with thicker oxide and longer channels Vout_H Vin_L VDDL N. H. E. Weste and D. Harris, CMOS VLSI Design, Third Edition, Section 12. 4. 3, Addison-Wesley, 2005. Copyright Agrawal, 2011 Lecture 15: Multicore Design 5

Level Converter: H to L VDDL Vin_H Transistors with thicker oxide and longer channels

Level Converter: H to L VDDL Vin_H Transistors with thicker oxide and longer channels Vout_L N. H. E. Weste and D. Harris, CMOS VLSI Design, Third Edition, Section 12. 4. 3, Addison-Wesley, 2005. Copyright Agrawal, 2011 Lecture 15: Multicore Design 6

Control Signals, N = 4 CK Phase 1 Phase 2 Phase 3 Phase 4

Control Signals, N = 4 CK Phase 1 Phase 2 Phase 3 Phase 4 Copyright Agrawal, 2011 Lecture 15: Multicore Design 7

Power PN = Pproc + Poverhead Pproc = N(Cinreg+ Ccomb)VN 2 f/N + Coutreg.

Power PN = Pproc + Poverhead Pproc = N(Cinreg+ Ccomb)VN 2 f/N + Coutreg. VN 2 f = (Cinreg+ Ccomb+Coutreg)VN 2 f = Cref. VN 2 f Poverhead = Coverhead. VN 2 f PN [1 + δ(N – 1)]Cref. VN 2 f = PN ── P 1 Copyright Agrawal, 2011 = ≈ δCref(N – 1)VN 2 f VN 2 [1 + δ(N – 1)] ─── Vref 2 Lecture 15: Multicore Design 8

Voltage vs. Speed Delay of a gate, T ≈ CLVref ──── I CLVref ─────

Voltage vs. Speed Delay of a gate, T ≈ CLVref ──── I CLVref ───── k(W/L)(Vref – Vt)2 = where I is saturation current k is a technology parameter W/L is width to length ratio of transistor Vt is threshold voltage Normalized gate delay, T 4. 0 N=2 2. 0 N=1 1. 0 0. 0 Copyright Agrawal, 2011 N=3 3. 0 1. 2μ CMOS Voltage reduction slows down as we get closer to Vt Vt V 3 V 2=2. 9 V Lecture 15: Multicore Design Vref =5 V Supply voltage 9

Increasing Multiprocessing 1. 0 1. 2μ CMOS, Vref = 5 V 0. 8 Vt=0.

Increasing Multiprocessing 1. 0 1. 2μ CMOS, Vref = 5 V 0. 8 Vt=0. 8 V 0. 6 PN/P 1 Vt=0. 4 V 0. 4 0. 2 Vt=0 V (extreme case) 0. 0 1 2 3 4 5 6 7 8 9 10 11 12 N Copyright Agrawal, 2011 Lecture 15: Multicore Design 10

Extreme Cases: Vt = 0 Delay, T α 1/ Vref For N processing elements,

Extreme Cases: Vt = 0 Delay, T α 1/ Vref For N processing elements, delay = NT → VN = Vref/N PN ── P 1 = [1+ δ (N – 1)] 1 ── N 2 → 1/N For negligible overhead, δ→ 0 PN ── P 1 ≈ 1 ── N 2 For Vt > 0, power reduction is less and there will be an optimum value of N. Copyright Agrawal, 2011 Lecture 15: Multicore Design 11

Example: Multiplier Core l Specification: l 200 MHz Clock l 15 W dissipation @

Example: Multiplier Core l Specification: l 200 MHz Clock l 15 W dissipation @ 5 V l Low voltage operation, VDD ≥ 1. 5 volts Relative clock rate l = (VDD – 0. 5)2 ─────── 20. 25 Problem: l Integrate multiplier core on a SOC l Power budget for multiplier ~ 5 W Copyright Agrawal, 2011 Lecture 15: Multicore Design 12

Input Multiplier Core 2 200 MHz CK Multiphase Clock gen. and mux control 40

Input Multiplier Core 2 200 MHz CK Multiphase Clock gen. and mux control 40 MHz Reg 40 MHz Output Reg 40 MHz Multiplier Core 1 5 to 1 mux Reg A Multicore Design 200 MHz Multiplier Core 5 Core clock frequency = 200/N, N should divide 200. Copyright Agrawal, 2011 Lecture 15: Multicore Design 13

How Many Cores? l For N cores: l clock frequency = 200/N MHz l

How Many Cores? l For N cores: l clock frequency = 200/N MHz l Supply voltage, VDDN = 0. 5 + (20. 25/N)1/2 volts l Assuming 10% overhead per core, VDDN 2 Power dissipation =15 [1 + 0. 1(N – 1)] (───) watts 5 Copyright Agrawal, 2011 Lecture 15: Multicore Design 14

Design Tradeoffs Clock (MHz) Core supply VDDN (Volts) Total Power (Watts) 1 200 5.

Design Tradeoffs Clock (MHz) Core supply VDDN (Volts) Total Power (Watts) 1 200 5. 00 15. 0 2 100 3. 68 8. 94 4 50 2. 75 5. 90 5 40 2. 51 5. 29 8 25 2. 10 4. 50 Number of cores, N Copyright Agrawal, 2011 Lecture 15: Multicore Design 15

Power Reduction in Processors l l Just about everything is used. Hardware methods: Voltage

Power Reduction in Processors l l Just about everything is used. Hardware methods: Voltage reduction for dynamic power l Dual-threshold devices for leakage reduction l Clock gating, frequency reduction l Sleep mode l l Architecture: Instruction set l hardware organization l l Software methods Copyright Agrawal, 2011 Lecture 15: Multicore Design 16

Parallel Architecture Processor Input Processor Output f/2 Input f Processor Capacitance = C Voltage

Parallel Architecture Processor Input Processor Output f/2 Input f Processor Capacitance = C Voltage = V Frequency = f Power = CV 2 f f/2 Copyright Agrawal, 2011 Lecture 15: Multicore Design f Capacitance = 2. 2 C Voltage = 0. 6 V Frequency = 0. 5 f Power = 0. 396 CV 2 f 17

Output Input ½ Proc. Register Processor Register Input Register Pipeline Architecture ½ Proc. Output

Output Input ½ Proc. Register Processor Register Input Register Pipeline Architecture ½ Proc. Output f f Capacitance = 1. 2 C Voltage = 0. 6 V Frequency = f Power = 0. 432 CV 2 f Capacitance = C Voltage = V Frequency = f Power = CV 2 f Copyright Agrawal, 2011 Lecture 15: Multicore Design 18

Approximate Trend n-parallel proc. n-stage pipeline proc. Capacitance n. C C Voltage V/n Frequency

Approximate Trend n-parallel proc. n-stage pipeline proc. Capacitance n. C C Voltage V/n Frequency f/n f Power CV 2 f/n 2 Chip area n times 10 -20% increase G. K. Yeap, Practical Low Power Digital VLSI Design, Boston: Springer, 1998. Copyright Agrawal, 2011 Lecture 15: Multicore Design 19

Performance based on SPECint 2000 and SPECfp 2000 benchmarks Multicore Processors Copyright Agrawal, 2011

Performance based on SPECint 2000 and SPECfp 2000 benchmarks Multicore Processors Copyright Agrawal, 2011 Computer, May 2005, p. 12 Multicore Single core 2000 2004 Lecture 15: Multicore Design 2008 20

Multicore Processors l l l D. Geer, “Chip Makers Turn to Multicore Processors, ”

Multicore Processors l l l D. Geer, “Chip Makers Turn to Multicore Processors, ” Computer, vol. 38, no. 5, pp. 11 -13, May 2005. A. Jerraya, H. Tenhunen and W. Wolf, “Multiprocessor Systems-on-Chips, ” Computer, vol. 5, no. 7, pp. 36 -40, July 2005; this special issue contains three more articles on multicore processors. S. K. Moore, “Winner Multimedia Monster – Cell’s Nine Processors Make It a Supercomputer on a Chip, ” IEEE Spectrum, vol. 43. no. 1, pp. 20 -23, January 2006. Copyright Agrawal, 2011 Lecture 15: Multicore Design 21

Cell - Cell Broadband Engine Architecture © IEEE Spectrum, January 2006 Nine-processor chip: 192

Cell - Cell Broadband Engine Architecture © IEEE Spectrum, January 2006 Nine-processor chip: 192 Gflops Copyright Agrawal, 2011 L to R Atsushi Kameyama, Toshiba James Kahle, IBM Masakazu Suzoki, Sony Lecture 15: Multicore Design 22

Cell’s Nine-Processor Chip © IEEE Spectrum, January 2006 Copyright Agrawal, 2011 Lecture 15: Multicore

Cell’s Nine-Processor Chip © IEEE Spectrum, January 2006 Copyright Agrawal, 2011 Lecture 15: Multicore Design Eight Identical Processors f = 5. 6 GHz (max) 44. 8 Gflops 23