ECE 552 CPS 550 Advanced Computer Architecture I

Computing Devices (Then) Mark I Harvard University, 1944 ECE 552 / CPS 550 EDSAC

Computing Devices (Now) i. Pad Apple/ARM, 2010 ECE 552 / CPS 550 Blue Gene/P

Computer Architecture Application Gap too large to bridge in one step Physics Computer architecture

Abstraction Layers Application Algorithm Programming Language Operating System/Virtual Machines Domain of early computer architecture

An Integrated Approach Architect Systems - Coordinate system across hardware-software interface ~ Technology, hardware,

ECE 552 Executive Summary In-order Datapath (understand, ECE 250) (built, ECE 350) ECE 552

ECE 552 Administrivia Instructor Prof. Benjamin Lee benjamin. c. lee@duke. edu Office Hours: Tu.

ECE 552 Prerequisites Participation - Electrical and Computer Engineering, Computer Science - Ph. D,

ECE 552 Lectures 1. Design Metrics a) b) c) Performance Power Early machines 2.

ECE 552 Readings 1. Technology a) b) Moore’s Law Technology scaling 5. Parallelism I

ECE 552 Components 30% Homework and Readings - Homework done in teams of 3

ECE 552 Academic Policy University policy as codified by the Duke Undergraduate Honor Code

ECE 552 Term Project Scope - Semester-long research project - Teams of 3 -

ECE 552 Upcoming Deadlines 1 September – Reading #1 Due Readings are available on

Latency versus Throughput Definitions - Latency: time to finish given task (a. k. a.

Aggregating Performance Addition - Latency is additive. Throughput is not. - Example: Consider applications

Processor Performance (vs. VAX-11/780) 10000 SPECint Benchmarks. Hennessy and Patterson, Computer Architecture: A Quantitative

Performance Factors Latency = (Seconds / Cycle) x (Cycles / Instruction) x (Instructions /

Moore’s Law - Moore. “Cramming more components onto integrated circuits. ” Electronics, Vol 38,

Field-Effect Transistors - MOS: metal-oxide semiconductor FET: field-effect transistor - Charge carriers flow between

Complementary MOS (CMOS) - Map voltages to logical values (Vdd=1, Gnd=0) Implement complementary Boolean

Transistor Dimensions - Process defined by feature size (F), layout design (l = F/2)

Dennard Scaling - Dennard et al. “Design of ion-implanted MOSFETs with very small physical

Dennard Scaling Limits - Horowitz et al. “Scaling, power, and the future of CMOS.

Cycles per Instruction (CPI) Average Instruction Latency - Different instructions require different number of

CPI and Design Baseline Processor + Application - Integer ALU: 50%, 1 cycle -

Measuring CPI Physical Measurements - Measure wall clock time as application runs - Multiply

Benchmarking Measuring Performance - Target Workload: accurate but not portable - Representative Benchmark: portable

Single Accumulator - Carry-over from calculators, typically less than 2 -dozen instructions - Single

Using Accumulator Ci Ai + Bi, 1 i n LOOP LOAD JGE ADD STORE

Self-Modifying Code Ci Ai + Bi, 1 i n LOOP F 1 F 2

Index Registers Specialized registers to simplify address calculations - T. Kilburn, Manchester University, 1950

Using Index Registers Ci Ai + Bi, 1 i n LOADi -n, IX #

Modifying Index Registers Option 1: Increment index register by k AC (IX) AC (AC)

Evolution of Addressing Modes 1. Single accumulator, absolute address Load x AC M[x] 2.

Evolution of Instruction Formats Zero-address Formats - Instructions have zero operands - Operands on

Evolution of Instruction Formats Two-address Formats - Destination is same as one of the

Data Formats Data Sizes - Bytes, Half-words, double words Byte Addressing - Location of

Software Developments Numerical Libraries (up to 1955) - floating-point operations - transcendental functions -

Pitfall: Incomplete Metrics Ignoring Instructions per Program - Neglect dynamic instruction count - Misleading

Pitfall: Diminishing Returns - Amdahl. “Validity of the single-processor approach…” AFIPS, 1967. Amdhal’s Law

Power Factors Definitions - Energy (Joules) = a x C x V 2 -

Power and Temperature • Power density (Watts / sq-mm) is proxy for thermal effects

Power and Multiprocessors • Chip multiprocessors (CMPs) integrate multiple cores on die Efficiency •

Power and Multiprocessors Lower voltage, frequency • • • Voltage, frequency scale together Power

Cost Non-recurring Engineering (NRE) - Dominated by engineer-years ($200 K per engineer-year) - Mask

Yield Wafers - Integrated circuits built with multi-step chemical process on wafers - Cost

Compatibility Early 1960 s IBM had 4 incompatible computers - IBM 701, 650, 702,

IBM 360: Design Principles Amdahl, Blaauw and Brooks, “Architecture of the IBM System/360” 1964

IBM 360: General Purpose Registers Processor State • • 16, 32 -bit general-purpose registers

IBM 360: Initial Implementation Storage Datapath 8 -bit Circuit Delay Local Store Control Store

IBM z 11: 47 Years Later Technology (seconds / cycle) 5. 2 GHz in

Acknowledgements These slides contain material developed and copyright by - Arvind (MIT) - Krste

Slides: 63

Download presentation

ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 1 Metrics and Early Machines Benjamin Lee Electrical and Computer Engineering Duke University www. duke. edu/~bcl 15/class_ece 552 fall 16. html

Computing Devices (Then) Mark I Harvard University, 1944 ECE 552 / CPS 550 EDSAC University of Cambridge, 1949 2

Computing Devices (Now) i. Pad Apple/ARM, 2010 ECE 552 / CPS 550 Blue Gene/P IBM, 2007 3

Computer Architecture Application Gap too large to bridge in one step Physics Computer architecture is the design of abstraction layers, which allow efficient implementations of computational applications on available technologies ECE 552 / CPS 550 4

Abstraction Layers Application Algorithm Programming Language Operating System/Virtual Machines Domain of early computer architecture (‘ 50 s-’ 80 s) Instruction Set Architecture (ISA) Microarchitecture Gates/Register-Transfer Level (RTL) Domain of recent computer architecture (since ‘ 90 s) Circuits Devices Physics ECE 552 / CPS 550 5

An Integrated Approach Architect Systems - Coordinate system across hardware-software interface ~ Technology, hardware, run-time software, compilers, apps - Responsible for end-to-end functionality Design and Analyze - Search design space of computer systems - Evaluate designs with quantitative metrics ~ Performance, power, cost Navigate Computing Landscape - Technologies are emerging - Applications are demanding - Systems are scaling ECE 552 / CPS 550 6

ECE 552 Executive Summary In-order Datapath (understand, ECE 250) (built, ECE 350) ECE 552 / CPS 550 Chip Multiprocessors (understand, experiment ECE 552) 7

ECE 552 Administrivia Instructor Prof. Benjamin Lee benjamin. c. lee@duke. edu Office Hours: Tu. Th 4 -5 pm, Hudson 210 Teaching Ramin Bashizade, ramin. bashizade@duke. edu Assistants Office Hours: WF 3: 30 – 4: 30 pm, LSRC D 301 Tamara Lehman, tamara. silbergleit@duke. edu Office Hours: Tu. Th 11: 30 – 12: 30 pm, TBD Lectures Tu/Th 10: 05 – 11: 20 AM, Teer 203 Text Computer Architecture: A Quantitative Approach, 5 th Edition (2012). Do not use earlier editions Web http: //www. duke. edu/~BCL 15/class_ece 552 fall 16. html ECE 552 / CPS 550 8

ECE 552 Prerequisites Participation - Electrical and Computer Engineering, Computer Science - Ph. D, MS, Undergraduates Prerequisites - Introduction to computer architecture (CPS 104, ECE 152, or equiv. ) - Programming (homework/projects in C, C++) Background Knowledge - Instruction sets, computer arithmetic, assembly programming D. A. Patterson and J. L. Hennessy. Computer Organization and Design: The Hardware/Software Interface, 5 th Edition. ECE 552 / CPS 550 9

ECE 552 Lectures 1. Design Metrics a) b) c) Performance Power Early machines 2. Simple Pipelining a) b) c) d) Multi-cycle machines Branch prediction In-order superscalar Optimizations 3. Complex Pipelining a) b) Score-boarding, Tomasulo algorithm Out-of-order superscalar 4. Explicitly Parallel Architectures a) b) c) VLIW Vector machines Multi-threading 5. Memory Systems a) b) c) Caches DRAM Virtual memory 6. Multiprocessors a) b) Memory models Coherence protocols Midterm Exam Fall Break ECE 552 / CPS 550 10

ECE 552 Readings 1. Technology a) b) Moore’s Law Technology scaling 5. Parallelism I a) b) Data flow processors Simultaneous multi-threading 2. History 6. Memory a) b) Classic machines The 801 minicomputer 3. Pipelining a) b) Power as a design constraint Optimizing pipeline depth Victim cache Phase change memory 7. Parallelism II a) b) Consistency Coherence 4. Microarchitecture a) b) Branch prediction Complexity and superscalar design ECE 552 / CPS 550 11

ECE 552 Components 30% Homework and Readings - Homework done in teams of 3 - 5 classes dedicated to paper discussions 20% Midterm exam - 75 minutes (in class), closed book 20% Final exam - 3 hours, closed-book 30% Term project/paper - Project done in teams of 3 Academic Policy University policy as codified by Duke Undergraduate Honor Code will be strictly enforced. Zero tolerance for cheating and/or plagiarism. ECE 552 / CPS 550 12

ECE 552 Academic Policy University policy as codified by the Duke Undergraduate Honor Code will be strictly enforced. Zero tolerance for cheating and/or plagiarism. If a student is suspect of academic dishonesty (e. g. , cheating on an exam, copying a lab report, collaborating inappropriately on an assignment), faculty are required to report the matter to the Office of Student Conduct. A student found responsible for academic dishonesty faces formal disciplinary action, which may include suspension. A student suspended twice for academic dishonesty automatically faces a minimum 5 -year separation from Duke University. ECE 552 / CPS 550 13

ECE 552 Term Project Scope - Semester-long research project - Teams of 3 - Students propose project ideas (Oct 14) Final Paper - 6 -12 page research paper - Evaluate research idea quantitatively - Survey and cite related work ECE 552 / CPS 550 14

ECE 552 Upcoming Deadlines 1 September – Reading #1 Due Readings are available on Sakai. Submit reading responses on Sakai. 1. Moore. “Cramming more components onto integrated circuits” 2. Horowitz et al. “Scaling, power, and the future of CMOS” 15 September – Homework #1 Due Homework will be available on Sakai. Submit homework on Sakai in teams of two. ECE 552 / CPS 550 15

Performance

Latency versus Throughput Definitions - Latency: time to finish given task (a. k. a. execution time) - Throughput: number of tasks in given time (a. k. a. bandwidth) - Throughput exploits parallelism. Latency cannot Example: Move people from Duke to UNC, 10 miles - Car: capacity = 5, speed = 60 miles/hour - Latency = (10 miles @ 60 miles/hour )= 10 minutes - Throughput = (3 trips @ 60 miles per hour) = 15 people/hour - Bus: capacity = 60, speed = 20 miles/hour - Latency = (10 miles @ 20 miles/hour) = 30 minutes - Throughput = (1 trip @ 20 miles per hour) = 60 people/hour ECE 552 / CPS 550 17

Aggregating Performance Addition - Latency is additive. Throughput is not. - Example: Consider applications A 1 and A 2 on processor P - Latency(A 1, A 2) = Latency(A 1) + Latency(A 2) - Throughput (A 1, A 2) = 1/[1/Throughput(A 1) + 1/Throughput(A 2)] Averages - Arithmetic Mean: (1/N) * ∑P=1. . N Latency(P) - For measures that are proportional to time (e. g. , latency) - Harmonic Mean: N / ∑P=1. . N 1/Throughput(P) - For measures that are inversely proportional to time (e. g. , throughput) - Geometric Mean: (∏P=1. . N Speedup(P))^(1/N) - For ratios (e. g. , speed-ups) ECE 552 / CPS 550 18

Processor Performance (vs. VAX-11/780) 10000 SPECint Benchmarks. Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4 th Edition, 2006. ? ? %/year 1000 52%/year 100 10 25%/year 1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 ECE 552 / CPS 550 19

Performance Factors Latency = (Seconds / Cycle) x (Cycles / Instruction) x (Instructions / Program) Seconds / Cycle - Technology and architecture - Transistor scaling - Processor microarchitecture Cycles / Instruction (CPI) - Architecture and systems - Processor microarchitecture - System balance (processor, memory, network, storage) Instructions / Program - Algorithm and applications - Compiler transformations, optimizations - Instruction set architecture ECE 552 / CPS 550 20

Moore’s Law - Moore. “Cramming more components onto integrated circuits. ” Electronics, Vol 38, No. 8, 1965. As integration increases, packaging cost decreases How does Moore’s Law impact performance? ECE 552 / CPS 550 22

Field-Effect Transistors - MOS: metal-oxide semiconductor FET: field-effect transistor - Charge carriers flow between source-drain, controlled by gate voltage Abstract MOSFET as electrical switch Source Gate Drain Width Length ECE 552 / CPS 550 Bulk Gate Channel Source 23

Complementary MOS (CMOS) - Map voltages to logical values (Vdd=1, Gnd=0) Implement complementary Boolean logic - n. FET: conduct charge when Vg = Vdd, used in pull-down network p. FET: conduct charge when Vg = Gnd, used in pull-up network Examples: Inverter, NAND Vdd A p. FET A !A n. FET B !(AB) A B Gnd ECE 552 / CPS 550 24

Transistor Dimensions - Process defined by feature size (F), layout design (l = F/2) Example: F=2 l =45 nm process technology - Transistor dimensions determine technology performance Transistor drive strength (i. e. , speed) increases as channel length shrinks Minimum Length=2 l Source Gate Drain Width=4 l Length ECE 552 / CPS 550 Bulk 25

Dennard Scaling - Dennard et al. “Design of ion-implanted MOSFETs with very small physical dimensions, ” Journal Solid State Circuits, 1974. - Scale not only dimensions but also doping concentration and voltage Transistors become faster (1. 4 x) Applied to Moore’s Law: k=1. 4, 1/k = 0. 7 every 18 -24 months Gate Drain Source Width Length ECE 552 / CPS 550 Bulk 26

Dennard Scaling Limits - Horowitz et al. “Scaling, power, and the future of CMOS. ” IEDM, 2005. Classical Dennard scaling ended at 130 nm in 2000 -2001. - Oxide Thickness: How to manage increasing leakage? Use high-K dielectrics Channel Length: How to manage increasing leakage? Stop scaling L Doping Concentration: How to handle imprecise doping? Manage variability Voltage: How to manage increasing leakage? Stop scaling V Current: How to increase current with shrinking channels? Stress silicon - Example: Intel 22 nm process technology with Fin. FET Image: Courtesy Intel Corp. ECE 552 / CPS 550 27

Cycles per Instruction (CPI) Average Instruction Latency - Different instructions require different number of cycles - Examine instruction frequency - CPI is slightly easier to calculate than IPC (time versus rate) Example - Instruction frequency: 1/3 INT, 1/3 FP, 1/3 MEM operations - Instruction cycles: 1 cy INT, 3 cy FP, 2 cy MEM - CPI = (1/3 x 1) + (1/3 x 3) + (1/3 x 2) Caveat - CPI provides high-level, quick estimates of performance - Does not account for details (e. g. , instruction dependences) ECE 552 / CPS 550 30

CPI and Design Baseline Processor + Application - Integer ALU: 50%, 1 cycle - Load: 20%, 5 cycle - Store: 10%, 1 cycle - Branch: 20%, 2 cycle Possible Enhancements - Option 1: Branch prediction for 1 -cycle branch - Option 2: Bigger data cache for 3 -cycle load - Which enhancement is preferred? Cycles Per Instruction - Base = (0. 5 x 1) + (0. 2 x 5) + (0. 1 x 1) + (0. 2 x 2) = 2 cycles - Option 1 = (0. 5 x 1) + (0. 2 x 5) + (0. 1 x 1) + (0. 2 x 1) = 1. 8 cycles - Option 1 = (0. 5 x 1) + (0. 2 x 3) + (0. 1 x 1) + (0. 2 x 2) = 1. 6 cycles ECE 552 / CPS 550 31

Measuring CPI Physical Measurements - Measure wall clock time as application runs - Multiply time by clock frequency to get cycles - Profile application with hardware counters (e. g. , Intel VTune) Simulated Measurements - Cycle-level, microarchitectural simulation (e. g. , Simple. Scalar) - Run applications on simulated hardware - Track instructions as they progress through the design ECE 552 / CPS 550 32

Benchmarking Measuring Performance - Target Workload: accurate but not portable - Representative Benchmark: portable but not accurate - Microbenchmark: small, fast code sequences but incomplete Representative Benchmarks - SPEC (Standard Performance Evaluation Corporation, www. spec. org) - Collects, standardizes, distributes benchmark programs - Scientific and commercial computing - SPLASH-2, NAS, SPEC Open. MP, SPECjbb - Online transaction processing (OLTP) with heavy I/O, memory - TPC-C, TPC-H, TPC-W - Datacenter workloads - Search (e. g. , Nutch/Lucene), analytics (e. g, . Hadoop, Spark) ECE 552 / CPS 550 33

Single Accumulator - Carry-over from calculators, typically less than 2 -dozen instructions - Single operand (AC) LOAD x STORE x AC M[x] (AC) ADD x SUB x AC (AC) + M[x] SHIFT LEFT SHIFT RIGHT AC 2 (AC) JUMP x JGE x PC x if (AC) ³ 0 then PC x LOAD ADR x STORE ADR x AC Extract address field (M[x]) ECE 552 / CPS 550 35

Using Accumulator Ci Ai + Bi, 1 i n LOOP LOAD JGE ADD STORE N DONE N # AC M[N] # if(AC>0), PC DONE # AC + 1 # M[N] AC F 1 F 2 F 3 LOAD ADD STORE JUMP A B C LOOP # AC M[A] # AC (AC) + M[B] # M[C] (AC) DONE HLT Notice M[N] is a counter, not an index. How to modify the addresses A, B and C ? ECE 552 / CPS 550 36

Self-Modifying Code Ci Ai + Bi, 1 i n LOOP F 1 F 2 F 3 DONE LOAD JGE ADD STORE LOAD ADRF 1 ADD STORE ADR LOAD ADRF 2 ADD STORE ADR LOAD ADRF 3 ADD STORE ADR JUMP HLT ECE 552 / CPS 550 N DONE N A B C ONE F 1 ONE F 2 ONE F 3 LOOP # AC M[N] # if (AC >= 0), PC DONE # AC + M[ONE] # M[N] AC # AC M[A] # AC + M[B] # M[C] (AC) # AC address field (M[F 1]) # AC + M[ONE] # changes address of A # changes address of B # changes address of C Each iteration requires: total Inst fetch 17 Stores 5 book-keeping 14 4 37

Index Registers Specialized registers to simplify address calculations - T. Kilburn, Manchester University, 1950 s - Instead of single AC register, use AC and IX registers Modify Existing Instructions - Load x, IX - Add x, IX AC M[x + (IX)] AC (AC) + M[x + (IX)] Add New Instructions - Jzi x, IX - Loadi x, IX if (IX)=0, then PC x, else (IX)+1 IX M[x] (truncated to fit IX) Index registers have accumulator-like characteristics ECE 552 / CPS 550 38

Using Index Registers Ci Ai + Bi, 1 i n LOADi -n, IX # load n into IX LOOP JZi DONE, IX # if(IX=0), DONE LOAD LASTA, IX # AC M[LASTA + (IX)] ADD LASTB, IX # note: LASTA is address STORE LASTC, IX # of last element in A - Longer instructions. JUMP (1 -2 bits), LOOP index registers with ALU circuitry HALT - Does not. DONE require self-modifying code, modify IX instead - Improved program efficiency (operations per iteration) total book-keeping Inst fetch 5 2 Stores 1 0 ECE 552 / CPS 550 39

Modifying Index Registers Option 1: Increment index register by k AC (IX) AC (AC) + k IX (AC) new instruction Also, the AC must be saved and restored Option 2: Manipulate index register directly INCi k, IX STOREi x, IX IX (IX) + k M[x] (IX) (extended to fit a word) IX begins to resemble AC - Several index registers, accumulators - Motivates general-purpose registers (e. g. , MIPS ISA R 0 -R 31) ECE 552 / CPS 550 40

Evolution of Addressing Modes 1. Single accumulator, absolute address Load x AC M[x] 2. Single accumulator, index registers Load x, IX AC M[x + (IX)] 3. Single accumulator, indirection Load (x) AC M[M[x]] 4. Multiple accumulators, index registers, indirection Load Ri, IX, (x) Ri M[M[x] + (IX)] 5. Indirection through registers Load Ri, (Rj) Ri M[M[(Rj)]] 6. The Works Load Ri, Rj, (Rk) ECE 552 / CPS 550 Ri M[Rj + (Rk)]; Rj = index; Rk = base address 41

Evolution of Instruction Formats Zero-address Formats - Instructions have zero operands - Operands on a stack add M[sp] + M[sp-1] load M[sp] M[M[sp]] - Stack can be registers or memory - Top of stack usually cached in registers Register SP A B C One-address Formats - Instructions have one operand - Accumulator is always other implicit operand ECE 552 / CPS 550 42

Evolution of Instruction Formats Two-address Formats - Destination is same as one of the operand sources Ri (Ri) + (Rj) # (Reg x Reg) to Reg Ri (Ri) + M[x] # (Reg x Mem) to Reg - x can be specified directly or via register - x address calculation could include indexing, indirection, etc. Three-address Formats - One destination and up to two operand sources Ri (Rj) + (Rk) # (Reg x Reg) to Reg Ri (Rj) + M[x] # (Reg x Reg) to Reg ECE 552 / CPS 550 43

Data Formats Data Sizes - Bytes, Half-words, double words Byte Addressing - Location of most-, least- significant bits LSB MSB Big Endian MSB Little Endian Word Alignment - Suppose memory is organized into 32 -bit words (e. g. , 4 bytes). - Word aligned addresses begin only at 0, 4, 8, … bytes 0 1 ECE 552 / CPS 550 2 3 4 5 6 7 44

Software Developments Numerical Libraries (up to 1955) - floating-point operations - transcendental functions - matrix multiplication, equation solvers, etc. High-level Languages(1955 -1960) - Fortran, 1956 - assemblers, loaders, linkers, compilers Operating Systems (1955 -1960) - accounting programs to track usage and charges ECE 552 / CPS 550 45

Pitfall: Incomplete Metrics Ignoring Instructions per Program - Neglect dynamic instruction count - Misleading if working in algorithms, compilers, or ISA Using Instructions per Second - MIPS = (Instructions / Cycle) x (Cycles / Second) x 1 E-6 - FLOPS: considers only floating-point instructions - Example: CPI = 2, clock frequency = 500 MHz, 250 MIPS - Example: compiler removes instructions, latency falls, MIPS increases Using Clock Frequency - Cannot equate clock frequency with performance - Proc A: CPI = 2, f = 500 MHz - Proc B: CPI = 1, f = 300 MHz - Given the same ISA and compiler, B is faster ECE 552 / CPS 550 47

Pitfall: Diminishing Returns - Amdahl. “Validity of the single-processor approach…” AFIPS, 1967. Amdhal’s Law (Make Common Case Fast) Consider improving fraction F of system with a speedup S. T(new) = T(base) x (1 -F) + T(base) x F / S = T(base) x [(1 -F) + F/S] Speedup = 1 / [(1 -F) + F/S] = T(base)/T(new) Max Speedup = 1 / (1 – F) Example - Suppose FP computation is 1/4 of an application’s execution time - Maximum benefit from optimizing FP unit is 1. 3 x (=1/0. 75) - Multiprocessor systems were original application of this law - Accounts for diminishing marginal returns ECE 552 / CPS 550 48

Power

Power Factors Definitions - Energy (Joules) = a x C x V 2 - Power (Watts) = a x C x V 2 x f Power Factors and Trends - activity (a): function of application resource usage - capacitance (C): function of design; scales with area - voltage (V): constrained by leakage, which increases as V falls - frequency (f): varies with pipelining and transistor speeds - Models in cycle-accurate simulators (e. g. , Princeton Wattch) Dynamic Voltage and Frequency Scaling (DVFS) - P-states: move between operational modes with different V, f - Intel Turbo. Boost: increase V, f for short durations without violating thermal design point (TDP) ECE 552 / CPS 550 50

Power and Temperature • Power density (Watts / sq-mm) is proxy for thermal effects • Estimate thermal conductivity, resistance to identify processor hot spots (e. g. , Hot. Spot simulator) Power Budgets • • Power Package Cost 130 W servers, 65 W desktops, 10 -30 W laptops, 1 -2 W hand-held ECE 552 / CPS 550 51

Power and Multiprocessors • Chip multiprocessors (CMPs) integrate multiple cores on die Efficiency • • Reduce power with simpler cores Recover lost performance with many core parallelism ECE 552 / CPS 550 52

Power and Multiprocessors Lower voltage, frequency • • • Voltage, frequency scale together Power proportional to V 2 and f Performance proportional to f V∝ f Power ∝ V 2 f Perf ∝ f Example • • Baseline: 1 -core at V, f Multiprocessor: 4 -cores at 0. 85 V, 0. 85 f; program is 75% parallel Core Power @ lower V, f 0. 61 x =0. 853 Core Performance @ lower V, f 0. 85 x • • • Multicore Power @ lower V, f Multicore Performance @ 4 cores Multicore Performance @ lower f • • Multiprocessor: 1. 5% power per 1% performance [+144% power, +94% perf] Boosting V, f: 3% power per 1% performance [+(1. 013 -1) power, + (1. 01 -1) perf] ECE 552 / CPS 550 2. 44 x = 0. 61 x 4 2. 28 x = 1/[0. 25 + (0. 75 / 4)] 1. 94 x = 2. 28 x 0. 85 53

Cost

Cost Non-recurring Engineering (NRE) - Dominated by engineer-years ($200 K per engineer-year) - Mask costs (>$1 M per spin) Chip Cost - Depends on wafer and chip size, process maturity Packaging Cost - Depends on number of pins (e. g. , signal + power/ground) - Depends on thermal design point (e. g. , heat sink) Total Cost of Ownership - Capital costs (e. g. , server procurement cost) - Operating costs (e. g. , electricity) ECE 552 / CPS 550 55

Yield Wafers - Integrated circuits built with multi-step chemical process on wafers - Cost per wafer depends on wafer size, number of steps Chip (a. k. a. Die) - If chips are large, fewer chips per wafer - Larger chips have lower yield - Uniform defect density - Chip cost is proportional to area 2 -3 Process Variability - Yield is non-binary - Binning for speed grades - Binning for core count - Post-fabrication tuning with spares ECE 552 / CPS 550 56

Compatibility

Compatibility Early 1960 s IBM had 4 incompatible computers - IBM 701, 650, 702, 1401 - Different instruction set architecture - Different I/O system, secondary storage (magnetic taps, drums, disks) - Different assemblers, compilers, libraries - Different markets (e. g. , business, scientific, real-time) The need for compatibility motivated IBM 360. ECE 552 / CPS 550 58

IBM 360: Design Principles Amdahl, Blaauw and Brooks, “Architecture of the IBM System/360” 1964 1. Support growth and successor machines 2. Connect I/O devices with general method 3. Emphasize total performance - Evaluate programmability, answers per month not bits per second 4. Eliminate manual intervention - Machine must be capable of supervising itself 5. Reduce down time - Build hardware fault checking and fault location support 6. Facilitate assembly - Redundant I/O devices, memories for fault tolerance 7. Support flexibility - Some problems required floating-point words > 36 bits ECE 552 / CPS 550 59

IBM 360: General Purpose Registers Processor State • • 16, 32 -bit general-purpose registers use as index and base registers 4, 64 -bit floating-point registers Program status word (PSW) with program counter (PC) Condition codes, control flags Data Formats • • 8 -bit bytes: the IBM 360 is why bytes are 8 -bits long today! 16 -bit half-words 32 -bit words 64 -bit double-words ECE 552 / CPS 550 60

IBM 360: Initial Implementation Storage Datapath 8 -bit Circuit Delay Local Store Control Store Model 30 8 K - 64 KB Model 70 256 K - 512 KB 64 -bit 30 nsec/level 5 nsec/level Main Store Transistor Registers 1 microsecond read Conventional circuits Abstraction • IBM 360 ISA hid technologies across models Milestone • • The first true ISA designed as portable hardware-software interface With minor modifications, ISA still survives today ECE 552 / CPS 550 61

IBM z 11: 47 Years Later Technology (seconds / cycle) 5. 2 GHz in IBM 45 nm CMOS technology 1. 4 billion transistors in 512 sq-mm Microarchitecture (cycle / instruction) 64 -bit virtual addressing Out-of-order, 3 -way superscalar pipeline Redundant datapaths L 1 i-cache (64 KB); L 1 d-cache (128 KB) d-cache L 2 cache (1. 5 MB), private, per-core L 3 cache (24 MB), e. DRAM Power and Parallelism Quad-core design Scales to 96 cores in one machine IBM Hot. Chips 2010 ECE 552 / CPS 550 62

Acknowledgements These slides contain material developed and copyright by - Arvind (MIT) - Krste Asanovic (MIT/UCB) - Joel Emer (Intel/MIT) - James Hoe (CMU) - John Kubiatowicz (UCB) - Alvin Lebeck (Duke) - David Patterson (UCB) - Daniel Sorin (Duke) ECE 552 / CPS 550 63