MultiCore Parallelism for Low Power Design Vishwani D





























- Slides: 29

Multi-Core Parallelism for Low. Power Design Vishwani D. Agrawal James J. Danaher Professor Department of Electrical and Computer Engineering Auburn University http: //www. eng. auburn. edu/~vagrawal@eng. auburn. edu 2/8/06 D&T Seminar 1

Power Consumption of VLSI Chips Why is it a concern? 2/8/06 D&T Seminar 2

SIA Roadmap for Processors (1999) Year 1999 2002 2005 2008 2011 2014 Feature size (nm) 180 130 100 70 50 35 Logic transistors/cm 2 6. 2 M 18 M 39 M 84 M 180 M 390 M Clock (GHz) 1. 25 2. 1 3. 5 6. 0 10. 0 16. 9 Chip size (mm 2) 340 430 520 620 750 900 Power supply (V) 1. 8 1. 5 1. 2 0. 9 0. 6 0. 5 High-perf. Power (W) 90 130 160 175 183 Source: http: //www. semichips. org 2/8/06 D&T Seminar 3

ISSCC, Feb. 2001, Keynote Patrick P. Gelsinger Senior Vice President General Manager Digital Enterprise Group INTEL CORP. 2/8/06 “Ten years from now, microprocessors will run at 10 GHz to 30 GHz and be capable of processing 1 trillion operations per second -- about the same number of calculations that the world's fastest supercomputer can perform now. “Unfortunately, if nothing changes these chips will produce as much heat, for their proportional size, as a nuclear reactor. . ” D&T Seminar 4

VLSI Chip Power Density Source: Intel Sun’s Surface Power Density (W/cm 2) 10000 1000 Nuclear Reactor 100 8086 Hot Plate 10 4004 8008 8085 386 286 8080 1 1970 2/8/06 Rocket Nozzle 1980 P 6 Pentium® 486 1990 Year D&T Seminar 2000 2010 5

Power Dissipation in CMOS Logic (0. 25µ) Ptotal (0→ 1) = CL VDD 2 + tsc. VDD Ipeak + VDDIleakage VDD CL %75 2/8/06 %20 D&T Seminar %5 6

Low-Power Datapath Architecture • Lower supply voltage – This slows down circuit speed – Use parallel computing to gain the speed back • Works well when threshold voltage is also lowered. • About 60% reduction in power obtainable. • Reference: A. P. Chandrakasan and R. W. Brodersen, Low Power Digital CMOS Design, Boston: Kluwer Academic Publishers (Now Springer), 1995. 2/8/06 D&T Seminar 7

CK Combinational logic Output Cref Supply voltage Total capacitance switched per cycle Clock frequency Power consumption: Pref 2/8/06 Register Input Register A Reference Datapath D&T Seminar = Vref = Cref =f = Cref. Vref 2 f 8

Comb. Logic Copy 2 Multiphase Clock gen. and mux control f/N Register f/N N = Deg. of parallelism Register Input Comb. Logic Copy 1 Supply voltage: VN ≤ V 1 = Vref N to 1 multiplexer f/N Register A copy processes every Nth input, operates at reduced voltage Register A Parallel Architecture Output f Comb. Logic Copy N CK 2/8/06 D&T Seminar 9

Control Signals, N = 4 CK Phase 1 Phase 2 Phase 3 Phase 4 2/8/06 D&T Seminar 10

Power PN = Pproc + Poverhead Pproc = N(Cinreg+ Ccomb)VN 2 f/N + Coutreg. VN 2 f = (Cinreg+ Ccomb+Coutreg)VN 2 f = Cref. VN 2 f Poverhead = Coverhead. VN 2 f PN [1 + δ(N – 1)]Cref. VN 2 f = PN ── P 1 2/8/06 = ≈ δCref(N – 1)VN 2 f VN 2 [1 + δ(N – 1)] ─── Vref 2 D&T Seminar 11

Voltage vs. Speed Delay of a gate, T ≈ CLVref ──── I CLVref ───── k(W/L)(Vref – Vt)2 = where I is saturation current k is a technology parameter W/L is width to length ratio of transistor Vt is threshold voltage Normalized gate delay, T 4. 0 N=2 2. 0 N=1 1. 0 0. 0 2/8/06 N=3 3. 0 1. 2μ CMOS Voltage reduction slows down as we get closer to Vt Vt V 3 V 2=2. 9 V D&T Seminar Vref =5 V Supply voltage 12

Increasing Multiprocessing 1. 0 1. 2μ CMOS, Vref = 5 V 0. 8 Vt=0. 8 V 0. 6 PN/P 1 Vt=0. 4 V 0. 4 0. 2 Vt=0 V (extreme case) 0. 0 1 2 3 4 5 6 7 8 9 10 11 12 N 2/8/06 D&T Seminar 13

Extreme Cases: Vt = 0 Delay, T α 1/ Vref For N processing elements, delay = NT → VN = Vref/N PN ── P 1 = [1+ δ (N – 1)] 1 ── N 2 → 1/N For negligible overhead, δ→ 0 PN ── P 1 ≈ 1 ── N 2 For Vt > 0, power reduction is less and there will be an optimum value of N. 2/8/06 D&T Seminar 14

Example: Multiplier Core • Specification: • 200 MHz Clock • 15 W dissipation @ 5 V • Low voltage operation, VDD ≥ 1. 5 volts Relative clock rate = (VDD – 0. 5)2 ─────── 20. 25 • Problem: • Integrate multiplier core on a SOC • Power budget for multiplier ~ 5 W 2/8/06 D&T Seminar 15

Input Multiplier Core 2 200 MHz CK Multiphase Clock gen. and mux control 40 MHz Reg 40 MHz Output Reg 40 MHz Multiplier Core 1 5 to 1 mux Reg A Multicore Design 200 MHz Multiplier Core 5 Core clock frequency = 200/N, N should divide 200. 2/8/06 D&T Seminar 16

How Many Cores? • For N cores: • clock frequency = 200/N MHz • Supply voltage, VDDN= 0. 5 + (20. 25/N)1/2 Volts • Assuming 10% overhead per core, VDDN 2 Power dissipation =15 [1 + 0. 1(N – 1)] (───) watts 5 2/8/06 D&T Seminar 17

Design Tradeoffs Number of cores N Clock (MHz) Core supply VDDN (Volts) Total Power (Watts) 1 200 5. 00 15. 0 2 100 3. 68 8. 94 4 50 2. 75 5. 90 5 40 2. 51 5. 29 8 25 2. 10 4. 50 2/8/06 D&T Seminar 18

Power Reduction in Processors • Just about everything is used. • Hardware methods: • • Voltage reduction for dynamic power Dual-threshold devices for leakage reduction Clock gating, frequency reduction Sleep mode • Architecture: • Instruction set • hardware organization • Software methods 2/8/06 D&T Seminar 19

Parallel Architecture Processor Input Processor Output Input 2/8/06 f/2 f Processor Capacitance = C Voltage = V Frequency = f Power = CV 2 f f/2 D&T Seminar f Capacitance = 2. 2 C Voltage = 0. 6 V Frequency = 0. 5 f Power = 0. 396 CV 2 f 20

Output Input ½ Proc. Output f f Capacitance = 1. 2 C Voltage = 0. 6 V Frequency = f Power = 0. 432 CV 2 f Capacitance = C Voltage = V Frequency = f Power = CV 2 f 2/8/06 ½ Proc. Register Processor Register Input Register Pipeline Architecture D&T Seminar 21

Approximate Trend n-parallel proc. n-stage pipeline proc. Capacitance n. C C Voltage V/n Frequency f/n f Power CV 2 f/n 2 Chip area n times 10 -20% increase G. K. Yeap, Practical Low Power Digital VLSI Design, Boston: Kluwer Academic Publishers, 1998. 2/8/06 D&T Seminar 22

Performance based on SPECint 2000 and SPECfp 2000 benchmarks Multicore Processors 2/8/06 Computer, May 2005, p. 12 Multicore Single core 2000 2004 D&T Seminar 2008 23

Multicore Processors • D. Geer, “Chip Makers Turn to Multicore Processors, ” Computer, vol. 38, no. 5, pp. 11 -13, May 2005. • A. Jerraya, H. Tenhunen and W. Wolf, “Multiprocessor Systems-on-Chips, ” Computer, vol. 5, no. 7, pp. 36 -40, July 2005; this special issue contains three more articles on multicore processors. • S. K. Moore, “Winner Multimedia Monster – Cell’s Nine Processors Make It a Supercomputer on a Chip, ” IEEE Spectrum, vol. 43. no. 1, pp. 20 -23, January 2006. 2/8/06 D&T Seminar 24

Cell - Cell Broadband Engine Architecture © IEEE Spectrum, January 2006 Nine-processor chip: 192 Gflops 2/8/06 L to R Atsushi Kameyama, Toshiba James Kahle, IBM Masakazu Suzoki, Sony D&T Seminar 25

Cell’s Nine-Processor Chip © IEEE Spectrum, January 2006 2/8/06 D&T Seminar Eight Identical Processors f = 5. 6 GHz (max) 44. 8 Gflops 26

? 2/8/06 D&T Seminar 27

Amdahl’s Law S P=1–S 0 1 Speedup = 1 ───── S + (1 – S)/ N Where N = number of parallel processors Example: time S = 0. 6, N = 10, Speedup = 1. 56 S = 0. 6, N = ∞, Speedup = 1. 67 Gene Amdahl, “Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities, ” AFIPS Conference Proceedings, (30), pp. 483 -485, 1967. 2/8/06 D&T Seminar 28

Question • Can we find a multi-processing law – for power reduction, or – for performance per watt 2/8/06 D&T Seminar 29