MultiCore Parallelism for Low Power Design Vishwani D

  • Slides: 29
Download presentation
Multi-Core Parallelism for Low. Power Design Vishwani D. Agrawal James J. Danaher Professor Department

Multi-Core Parallelism for Low. Power Design Vishwani D. Agrawal James J. Danaher Professor Department of Electrical and Computer Engineering Auburn University http: //www. eng. auburn. edu/~vagrawal@eng. auburn. edu 2/8/06 D&T Seminar 1

Power Consumption of VLSI Chips Why is it a concern? 2/8/06 D&T Seminar 2

Power Consumption of VLSI Chips Why is it a concern? 2/8/06 D&T Seminar 2

SIA Roadmap for Processors (1999) Year 1999 2002 2005 2008 2011 2014 Feature size

SIA Roadmap for Processors (1999) Year 1999 2002 2005 2008 2011 2014 Feature size (nm) 180 130 100 70 50 35 Logic transistors/cm 2 6. 2 M 18 M 39 M 84 M 180 M 390 M Clock (GHz) 1. 25 2. 1 3. 5 6. 0 10. 0 16. 9 Chip size (mm 2) 340 430 520 620 750 900 Power supply (V) 1. 8 1. 5 1. 2 0. 9 0. 6 0. 5 High-perf. Power (W) 90 130 160 175 183 Source: http: //www. semichips. org 2/8/06 D&T Seminar 3

ISSCC, Feb. 2001, Keynote Patrick P. Gelsinger Senior Vice President General Manager Digital Enterprise

ISSCC, Feb. 2001, Keynote Patrick P. Gelsinger Senior Vice President General Manager Digital Enterprise Group INTEL CORP. 2/8/06 “Ten years from now, microprocessors will run at 10 GHz to 30 GHz and be capable of processing 1 trillion operations per second -- about the same number of calculations that the world's fastest supercomputer can perform now. “Unfortunately, if nothing changes these chips will produce as much heat, for their proportional size, as a nuclear reactor. . ” D&T Seminar 4

VLSI Chip Power Density Source: Intel Sun’s Surface Power Density (W/cm 2) 10000 1000

VLSI Chip Power Density Source: Intel Sun’s Surface Power Density (W/cm 2) 10000 1000 Nuclear Reactor 100 8086 Hot Plate 10 4004 8008 8085 386 286 8080 1 1970 2/8/06 Rocket Nozzle 1980 P 6 Pentium® 486 1990 Year D&T Seminar 2000 2010 5

Power Dissipation in CMOS Logic (0. 25µ) Ptotal (0→ 1) = CL VDD 2

Power Dissipation in CMOS Logic (0. 25µ) Ptotal (0→ 1) = CL VDD 2 + tsc. VDD Ipeak + VDDIleakage VDD CL %75 2/8/06 %20 D&T Seminar %5 6

Low-Power Datapath Architecture • Lower supply voltage – This slows down circuit speed –

Low-Power Datapath Architecture • Lower supply voltage – This slows down circuit speed – Use parallel computing to gain the speed back • Works well when threshold voltage is also lowered. • About 60% reduction in power obtainable. • Reference: A. P. Chandrakasan and R. W. Brodersen, Low Power Digital CMOS Design, Boston: Kluwer Academic Publishers (Now Springer), 1995. 2/8/06 D&T Seminar 7

CK Combinational logic Output Cref Supply voltage Total capacitance switched per cycle Clock frequency

CK Combinational logic Output Cref Supply voltage Total capacitance switched per cycle Clock frequency Power consumption: Pref 2/8/06 Register Input Register A Reference Datapath D&T Seminar = Vref = Cref =f = Cref. Vref 2 f 8

Comb. Logic Copy 2 Multiphase Clock gen. and mux control f/N Register f/N N

Comb. Logic Copy 2 Multiphase Clock gen. and mux control f/N Register f/N N = Deg. of parallelism Register Input Comb. Logic Copy 1 Supply voltage: VN ≤ V 1 = Vref N to 1 multiplexer f/N Register A copy processes every Nth input, operates at reduced voltage Register A Parallel Architecture Output f Comb. Logic Copy N CK 2/8/06 D&T Seminar 9

Control Signals, N = 4 CK Phase 1 Phase 2 Phase 3 Phase 4

Control Signals, N = 4 CK Phase 1 Phase 2 Phase 3 Phase 4 2/8/06 D&T Seminar 10

Power PN = Pproc + Poverhead Pproc = N(Cinreg+ Ccomb)VN 2 f/N + Coutreg.

Power PN = Pproc + Poverhead Pproc = N(Cinreg+ Ccomb)VN 2 f/N + Coutreg. VN 2 f = (Cinreg+ Ccomb+Coutreg)VN 2 f = Cref. VN 2 f Poverhead = Coverhead. VN 2 f PN [1 + δ(N – 1)]Cref. VN 2 f = PN ── P 1 2/8/06 = ≈ δCref(N – 1)VN 2 f VN 2 [1 + δ(N – 1)] ─── Vref 2 D&T Seminar 11

Voltage vs. Speed Delay of a gate, T ≈ CLVref ──── I CLVref ─────

Voltage vs. Speed Delay of a gate, T ≈ CLVref ──── I CLVref ───── k(W/L)(Vref – Vt)2 = where I is saturation current k is a technology parameter W/L is width to length ratio of transistor Vt is threshold voltage Normalized gate delay, T 4. 0 N=2 2. 0 N=1 1. 0 0. 0 2/8/06 N=3 3. 0 1. 2μ CMOS Voltage reduction slows down as we get closer to Vt Vt V 3 V 2=2. 9 V D&T Seminar Vref =5 V Supply voltage 12

Increasing Multiprocessing 1. 0 1. 2μ CMOS, Vref = 5 V 0. 8 Vt=0.

Increasing Multiprocessing 1. 0 1. 2μ CMOS, Vref = 5 V 0. 8 Vt=0. 8 V 0. 6 PN/P 1 Vt=0. 4 V 0. 4 0. 2 Vt=0 V (extreme case) 0. 0 1 2 3 4 5 6 7 8 9 10 11 12 N 2/8/06 D&T Seminar 13

Extreme Cases: Vt = 0 Delay, T α 1/ Vref For N processing elements,

Extreme Cases: Vt = 0 Delay, T α 1/ Vref For N processing elements, delay = NT → VN = Vref/N PN ── P 1 = [1+ δ (N – 1)] 1 ── N 2 → 1/N For negligible overhead, δ→ 0 PN ── P 1 ≈ 1 ── N 2 For Vt > 0, power reduction is less and there will be an optimum value of N. 2/8/06 D&T Seminar 14

Example: Multiplier Core • Specification: • 200 MHz Clock • 15 W dissipation @

Example: Multiplier Core • Specification: • 200 MHz Clock • 15 W dissipation @ 5 V • Low voltage operation, VDD ≥ 1. 5 volts Relative clock rate = (VDD – 0. 5)2 ─────── 20. 25 • Problem: • Integrate multiplier core on a SOC • Power budget for multiplier ~ 5 W 2/8/06 D&T Seminar 15

Input Multiplier Core 2 200 MHz CK Multiphase Clock gen. and mux control 40

Input Multiplier Core 2 200 MHz CK Multiphase Clock gen. and mux control 40 MHz Reg 40 MHz Output Reg 40 MHz Multiplier Core 1 5 to 1 mux Reg A Multicore Design 200 MHz Multiplier Core 5 Core clock frequency = 200/N, N should divide 200. 2/8/06 D&T Seminar 16

How Many Cores? • For N cores: • clock frequency = 200/N MHz •

How Many Cores? • For N cores: • clock frequency = 200/N MHz • Supply voltage, VDDN= 0. 5 + (20. 25/N)1/2 Volts • Assuming 10% overhead per core, VDDN 2 Power dissipation =15 [1 + 0. 1(N – 1)] (───) watts 5 2/8/06 D&T Seminar 17

Design Tradeoffs Number of cores N Clock (MHz) Core supply VDDN (Volts) Total Power

Design Tradeoffs Number of cores N Clock (MHz) Core supply VDDN (Volts) Total Power (Watts) 1 200 5. 00 15. 0 2 100 3. 68 8. 94 4 50 2. 75 5. 90 5 40 2. 51 5. 29 8 25 2. 10 4. 50 2/8/06 D&T Seminar 18

Power Reduction in Processors • Just about everything is used. • Hardware methods: •

Power Reduction in Processors • Just about everything is used. • Hardware methods: • • Voltage reduction for dynamic power Dual-threshold devices for leakage reduction Clock gating, frequency reduction Sleep mode • Architecture: • Instruction set • hardware organization • Software methods 2/8/06 D&T Seminar 19

Parallel Architecture Processor Input Processor Output Input 2/8/06 f/2 f Processor Capacitance = C

Parallel Architecture Processor Input Processor Output Input 2/8/06 f/2 f Processor Capacitance = C Voltage = V Frequency = f Power = CV 2 f f/2 D&T Seminar f Capacitance = 2. 2 C Voltage = 0. 6 V Frequency = 0. 5 f Power = 0. 396 CV 2 f 20

Output Input ½ Proc. Output f f Capacitance = 1. 2 C Voltage =

Output Input ½ Proc. Output f f Capacitance = 1. 2 C Voltage = 0. 6 V Frequency = f Power = 0. 432 CV 2 f Capacitance = C Voltage = V Frequency = f Power = CV 2 f 2/8/06 ½ Proc. Register Processor Register Input Register Pipeline Architecture D&T Seminar 21

Approximate Trend n-parallel proc. n-stage pipeline proc. Capacitance n. C C Voltage V/n Frequency

Approximate Trend n-parallel proc. n-stage pipeline proc. Capacitance n. C C Voltage V/n Frequency f/n f Power CV 2 f/n 2 Chip area n times 10 -20% increase G. K. Yeap, Practical Low Power Digital VLSI Design, Boston: Kluwer Academic Publishers, 1998. 2/8/06 D&T Seminar 22

Performance based on SPECint 2000 and SPECfp 2000 benchmarks Multicore Processors 2/8/06 Computer, May

Performance based on SPECint 2000 and SPECfp 2000 benchmarks Multicore Processors 2/8/06 Computer, May 2005, p. 12 Multicore Single core 2000 2004 D&T Seminar 2008 23

Multicore Processors • D. Geer, “Chip Makers Turn to Multicore Processors, ” Computer, vol.

Multicore Processors • D. Geer, “Chip Makers Turn to Multicore Processors, ” Computer, vol. 38, no. 5, pp. 11 -13, May 2005. • A. Jerraya, H. Tenhunen and W. Wolf, “Multiprocessor Systems-on-Chips, ” Computer, vol. 5, no. 7, pp. 36 -40, July 2005; this special issue contains three more articles on multicore processors. • S. K. Moore, “Winner Multimedia Monster – Cell’s Nine Processors Make It a Supercomputer on a Chip, ” IEEE Spectrum, vol. 43. no. 1, pp. 20 -23, January 2006. 2/8/06 D&T Seminar 24

Cell - Cell Broadband Engine Architecture © IEEE Spectrum, January 2006 Nine-processor chip: 192

Cell - Cell Broadband Engine Architecture © IEEE Spectrum, January 2006 Nine-processor chip: 192 Gflops 2/8/06 L to R Atsushi Kameyama, Toshiba James Kahle, IBM Masakazu Suzoki, Sony D&T Seminar 25

Cell’s Nine-Processor Chip © IEEE Spectrum, January 2006 2/8/06 D&T Seminar Eight Identical Processors

Cell’s Nine-Processor Chip © IEEE Spectrum, January 2006 2/8/06 D&T Seminar Eight Identical Processors f = 5. 6 GHz (max) 44. 8 Gflops 26

? 2/8/06 D&T Seminar 27

? 2/8/06 D&T Seminar 27

Amdahl’s Law S P=1–S 0 1 Speedup = 1 ───── S + (1 –

Amdahl’s Law S P=1–S 0 1 Speedup = 1 ───── S + (1 – S)/ N Where N = number of parallel processors Example: time S = 0. 6, N = 10, Speedup = 1. 56 S = 0. 6, N = ∞, Speedup = 1. 67 Gene Amdahl, “Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities, ” AFIPS Conference Proceedings, (30), pp. 483 -485, 1967. 2/8/06 D&T Seminar 28

Question • Can we find a multi-processing law – for power reduction, or –

Question • Can we find a multi-processing law – for power reduction, or – for performance per watt 2/8/06 D&T Seminar 29