John L Hennessy David A Patternson Computer Architecture

  • Slides: 101
Download presentation

教材与主要参考书 • 张晨曦等,计算机体系结构,高等教育出版社 • John L. Hennessy, David A. Patternson, Computer Architecture: A Quantitative

教材与主要参考书 • 张晨曦等,计算机体系结构,高等教育出版社 • John L. Hennessy, David A. Patternson, Computer Architecture: A Quantitative Approach. Fourth Edition. San Francisco: Morgan Kaufmann Publishers, Inc. , 2006 • David A. Patternson, John L. Hennessy, Computer Organization & Design : The Hardware/Software Interface, Third Edition. San Francisco: Morgan Kaufmann Publishers, Inc. 2005 • Berkeley CS 152, CS 252 • The University of North Carolina: COMP 206 感 2021/10/16 谢 中国科学技术大学 3

Computing Devices Then… 2021/10/16 EDSAC, University of Cambridge, UK, 1949 中国科学技术大学 5

Computing Devices Then… 2021/10/16 EDSAC, University of Cambridge, UK, 1949 中国科学技术大学 5

Computing Systems Today • The world is a large parallel system Massive Cluster –

Computing Systems Today • The world is a large parallel system Massive Cluster – Microprocessors in everything – Vast infrastructure behind them Internet Connectivity Scalable, Reliable, Secure Services Databases Information Collection Remote Storage Online Games Commerce … Refrigerators Sensor Nets Gigabit Ethernet Clusters Cars MEMS for Sensor Nets 2021/10/16 Routers 中国科学技术大学 Robots 6

What is Computer Architecture? Application Gap too large to bridge in one step (but

What is Computer Architecture? Application Gap too large to bridge in one step (but there are exceptions, e. g. magnetic compass) Physics In its broadest definition, computer architecture is the design of the abstraction layers that allow us to implement information processing applications efficiently using available manufacturing technologies. 中国科学技术大学 2021/10/16 8

Abstraction Layers in Modern Systems Application Algorithm Programming Language Original domain of the computer

Abstraction Layers in Modern Systems Application Algorithm Programming Language Original domain of the computer architect (‘ 50 s-’ 80 s) Operating System/Virtual Machine Instruction Set Architecture (ISA) Microarchitecture Gates/Register-Transfer Level (RTL) Circuits Devices Domain of recent computer architecture (‘ 90 s) Parallel computing, security, … Reliability, power, … Physics Reinvigoration of computer architecture, mid-2000 s onward. 2021/10/16 中国科学技术大学 9

计算机体系结构的定义 • Computer Architecture = Instruction Set Architecture + Machine Organization + …. .

计算机体系结构的定义 • Computer Architecture = Instruction Set Architecture + Machine Organization + …. . • Instruction Set Architecture. . . the attributes of a [computing] system as seen by the programmer, i. e. the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls ,the logic design, and the physical implementation. – Amdahl, Blaaw, and Brooks, 1964 2021/10/16 中国科学技术大学 10

ISA: a Critical Interface software instruction set hardware 2021/10/16 中国科学技术大学 12

ISA: a Critical Interface software instruction set hardware 2021/10/16 中国科学技术大学 12

ISA的7个主要方面 • • Class of ISA Memory addressing Addressing modes Types and sizes of

ISA的7个主要方面 • • Class of ISA Memory addressing Addressing modes Types and sizes of operands Operations Control flow instructions Encoding an ISA 2021/10/16 中国科学技术大学 13

指令集结构举例 • Digital Alpha (v 1, v 3) 1992 -97 • HP PA-RISC (v

指令集结构举例 • Digital Alpha (v 1, v 3) 1992 -97 • HP PA-RISC (v 1. 1, v 2. 0) 1986 -96 • Sun Sparc (v 8, v 9) 1987 -95 • SGI MIPS (MIPS I, III, IV, V) 1986 -96 • Intel (8086, 80286, 80386, 1978 -96 80486, Pentium, MMX, . . . ) 2021/10/16 中国科学技术大学 14

MIPS R 3000 Instruction Set Architecture (Summary) Registers • 指令类型 – – » –

MIPS R 3000 Instruction Set Architecture (Summary) Registers • 指令类型 – – » – – R 0 - R 31 Load/Store Computational Jump and Branch Floating Point coprocessor PC HI Memory Management Special LO 3 种指令格式: all 32 bits wide R型 OP rs rt I型 OP rs rt J型 OP 2021/10/16 rd sa funct immediate jump target 中国科学技术大学 15

Example Organization • TI Super. SPARCtm TMS 390 Z 50 in Sun SPARCstation 20

Example Organization • TI Super. SPARCtm TMS 390 Z 50 in Sun SPARCstation 20 MBus Module Super. SPARC Floating-point Unit L 2 $ Integer Unit Inst Cache Ref MMU Data Cache Store Buffer Bus Interface 2021/10/16 CC MBus L 64852 MBus control M-S Adapter SBus DMA SBus Cards 中国科学技术大学 SCSI Ethernet DRAM Controller STDIO serial kbd mouse audio RTC Boot PROM Floppy 17

计算机体系结构研究的内容 Input/Output and Storage Disks, WORM, Tape Emerging Technologies Interleaving Bus protocols DRAM Memory

计算机体系结构研究的内容 Input/Output and Storage Disks, WORM, Tape Emerging Technologies Interleaving Bus protocols DRAM Memory Hierarchy VLSI Coherence, Bandwidth, Latency L 2 Cache L 1 Cache Instruction Set Architecture Addressing, Protection, Exception Handling Pipelining, Hazard Resolution, Superscalar, Reordering, Prediction, Speculation, Vector, VLIW, DSP, Reconfiguration 2021/10/16 RAID 中国科学技术大学 Pipelining and Instruction Level Parallelism 18

计算机体系结构研究内容(续) P M P S M °°° P M Interconnection Network Processor-Memory-Switch Multiprocessors Networks

计算机体系结构研究内容(续) P M P S M °°° P M Interconnection Network Processor-Memory-Switch Multiprocessors Networks and Interconnections 2021/10/16 中国科学技术大学 Shared Memory, Message Passing, Data Parallelism Network Interfaces Topologies, Routing, Bandwidth, Latency, Reliability 19

计算机体系结构设计过程 体系结构设计是循环渐进的过程: • Search the possible design space • Make selections • Evaluate the

计算机体系结构设计过程 体系结构设计是循环渐进的过程: • Search the possible design space • Make selections • Evaluate the selections made Creativity Cost / Performance Analysis Good Ideas Bad Ideas Mediocre Ideas Good measurement tools are required to accurately evaluate the selection. 中国科学技术大学 20 2021/10/16

计算机 程方法学 Evaluate Existing Implementation Systems for Complexity Bottlenecks Implementation Analysis Benchmarks Technology Trends

计算机 程方法学 Evaluate Existing Implementation Systems for Complexity Bottlenecks Implementation Analysis Benchmarks Technology Trends Implement Next Generation System Simulate New Designs and Organizations Workloads Design 2021/10/16 中国科学技术大学 21

嵌入式系统的解决方案 • Software Embedded Systems= CPU + RTOS • Hardware Embedded Systems = ASIC

嵌入式系统的解决方案 • Software Embedded Systems= CPU + RTOS • Hardware Embedded Systems = ASIC • Modern Embedded Systems 2021/10/16 中国科学技术大学 26

Software Embedded Systems = CPU + RTOS 2021/10/16 中国科学技术大学 27

Software Embedded Systems = CPU + RTOS 2021/10/16 中国科学技术大学 27

Hardware Embedded Systems = ASIC Features Area: 4. 6 mm x 5. 1 mm

Hardware Embedded Systems = ASIC Features Area: 4. 6 mm x 5. 1 mm Speed: 20 MHz @ 10 Mcps Technology: HP 0. 5 m Power: 16 m. W - 120 m. W (mode dependent) @ 20 MHz, 3. 3 V Avg. Acquisition Time: 10 s to 300 s • A direct sequence spread spectrum (DSSS) receiver ASIC (UCLA)(直接序列扩频接收器) 2021/10/16 中国科学技术大学 28

Modern Embedded Systems? DSP Code Application Specific Gates Analog I/O Processor Cores Memory •

Modern Embedded Systems? DSP Code Application Specific Gates Analog I/O Processor Cores Memory • Embedded systems 构成 – 面向特定应用的硬件 (boards, ASICs, FPGAs etc. ) » performance, low power – 软件运行在微处理器上: DSPs, controllers etc. » flexibility, complexity – 各种类型传感器和执行器 2021/10/16 中国科学技术大学 29

Complexity and Heterogeneity controller processes control panel ASIC DSP Assembly Code Real-time OS controller

Complexity and Heterogeneity controller processes control panel ASIC DSP Assembly Code Real-time OS controller Programmable DSP UI processes Programmable DSP Dual-ported RAM DSP Assembly Code CODEC • Heterogeneity within H/W & S/W parts as well – S/W: control oriented, DSP oriented – H/W: ASICs, COTS(Commercial-off-the-shelf ) ICs 2021/10/16 中国科学技术大学 30

System-on-Chip (So. C) • SC 3001 DIRAC chip (Sirius Communications) 2021/10/16 中国科学技术大学 32

System-on-Chip (So. C) • SC 3001 DIRAC chip (Sirius Communications) 2021/10/16 中国科学技术大学 32

Moore’s Law • “Cramming More Components onto Integrated Circuits” – Gordon Moore, Electronics, 1965

Moore’s Law • “Cramming More Components onto Integrated Circuits” – Gordon Moore, Electronics, 1965 • # on transistors on cost-effective integrated circuit double every 18 months 2021/10/16 中国科学技术大学 35 35

Uniprocessor Performance From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4 th edition,

Uniprocessor Performance From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4 th edition, October, 2006 ? d e ? ? ? n e p t a Wh 2021/10/16 p a h • VAX : 25%/year 1978 to 1986 • RISC + x 86: 52%/year 1986 to 2002 • RISC + x 86: ? ? %/year 2002 to present 中国科学技术大学 36

Power & Energy 2021/10/16 中国科学技术大学 37

Power & Energy 2021/10/16 中国科学技术大学 37

Limiting Force: Power Density 2021/10/16 中国科学技术大学 38

Limiting Force: Power Density 2021/10/16 中国科学技术大学 38

Technology Trends: Memory Capacity Bits 1950: Alan Turing predicted ~1 G by 2000 year

Technology Trends: Memory Capacity Bits 1950: Alan Turing predicted ~1 G by 2000 year 1986 1988 1991 1995 1997 1999 2001 2003 Year 2005 • Now 1. 4 X/yr, or 2 X every 2 years. 2007 • Over 10, 000 X since 1980! 2009 2021/10/16 中国科学技术大学 size (Mbi) 1 4 16 64 128 256 512 1024 (1 Gbi) 2048 (2 Gbi) 4096 (4 Gbi) 8192 (8 Gbi) 39

Computer Technology - Dramatic Change! • Memory – DRAM capacity: 2 x / 2

Computer Technology - Dramatic Change! • Memory – DRAM capacity: 2 x / 2 years (since ‘ 96); 64 x size improvement in last decade. • Processor – Speed 2 x / 1. 5 years (since ‘ 85); [slowing!] 100 X performance in last decade. • Disk – Capacity: 1. 8 x / 1 year (since ‘ 97) 250 X size in last decade. • Network Bandwidth – Bandwidth increasing more than 100% per year! – 10 Mb->100 Mb Ethernet 10 years; 100 Mb ->1 Gb 5 years 2021/10/16 中国科学技术大学 40

提高性能的手段 • Technology – 芯片集成度提高 – 芯片速度提高 • Machine Organization/Implementation – 流水线级数增多 – 指令并行度提高

提高性能的手段 • Technology – 芯片集成度提高 – 芯片速度提高 • Machine Organization/Implementation – 流水线级数增多 – 指令并行度提高 • Instruction Set Architecture – Reduced Instruction Set Computers (RISC) – 多媒体指令(MMX) – 显式指令并行 • Compiler technology – 挖掘代码的并行性 – 更深层次的优化 • Multi-core processor 2021/10/16 中国科学技术大学 42

Conventional Wisdom in Computer Architecture • Old Conventional Wisdom: Power is free, Transistors expensive

Conventional Wisdom in Computer Architecture • Old Conventional Wisdom: Power is free, Transistors expensive • New Conventional Wisdom: “Power wall” Power expensive, Transistors free (Can put more on chip than can afford to turn on) • Old CW: Sufficient increasing Instruction-Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, …) • New CW: “ILP wall” law of diminishing returns on more HW for ILP • Old CW: Multiplies are slow, Memory access is fast • New CW: “Memory wall” Memory slow, multiplies fast (200 clock cycles to DRAM memory, 4 clocks for multiply) • Old CW: Uniprocessor performance 2 X / 1. 5 yrs • New CW: Power Wall + ILP Wall + Memory Wall = Brick Wall – Uniprocessor performance now 2 X / 5(? ) yrs Sea change in chip design: multiple “cores” (2 X processors per chip / ~ 2 years) » More, simpler processors are more power efficient 2021/10/16 中国科学技术大学 44

Sea Change in Chip Design • Intel 4004 (1971): 4 -bit processor, 2312 transistors,

Sea Change in Chip Design • Intel 4004 (1971): 4 -bit processor, 2312 transistors, 0. 4 MHz, 10 micron PMOS, 11 mm 2 chip • RISC II (1983): 32 -bit, 5 stage pipeline, 40, 760 transistors, 3 MHz, 3 micron NMOS, 60 mm 2 chip • 125 mm 2 chip, 0. 065 micron CMOS = 2312 RISC II+FPU+Icache+Dcache – RISC II shrinks to ~ 0. 02 mm 2 at 65 nm – Caches via DRAM or 1 transistor SRAM? • Processor is the new transistor? 2021/10/16 中国科学技术大学 45

 • “We are dedicating all of our future product development to multicore designs.

• “We are dedicating all of our future product development to multicore designs. … This is a sea change in computing” Paul Otellini, President, Intel (2004) • Difference is all microprocessor companies have switched to multiprocessors (AMD, Intel, IBM, Sun; all new Apples 2+ CPUs) Procrastination penalized: 2 X sequential perf. / 5 yrs Biggest programming challenge: from 1 to 2 CPUs 2021/10/16 中国科学技术大学 46

Many. Core Chips: The future is here! • Intel 80 -core multicore chip (Feb

Many. Core Chips: The future is here! • Intel 80 -core multicore chip (Feb 2007) – – – 80 simple cores Two floating point engines /core Mesh-like "network-on-a-chip“ 100 million transistors 65 nm feature size Frequency Voltage Power Bandwidth 3. 16 GHz 0. 95 V 62 W 1. 62 Terabits/s 5. 1 GHz 1. 2 V 175 W 2. 61 Terabits/s 5. 7 GHz 1. 35 V 265 W 2. 92 Terabits/s Performance 1. 01 Teraflops 1. 63 Teraflops 1. 81 Teraflops • “Many. Core” refers to many processors/chip – 64? 128? Hard to say exact boundary • How to program these? – Use 2 CPUs for video/audio – Use 1 for word processor, 1 for browser – 76 for virus checking? ? ? 中国科学技术大学 • 2021/10/16 Something new is clearly needed here… 47

The End of the Uniprocessor Era Single biggest change in the history of computing

The End of the Uniprocessor Era Single biggest change in the history of computing systems ——摘自 Berkeyley CS 252 2021/10/16 中国科学技术大学 48

Problems with Sea Change • Algorithms, Programming Languages, Compilers, Operating Systems, Architectures, Libraries, …

Problems with Sea Change • Algorithms, Programming Languages, Compilers, Operating Systems, Architectures, Libraries, … not ready to supply Thread Level Parallelism or Data Level Parallelism for 1000 CPUs / chip • Need whole new approach • People have been working on parallelism for over 50 years without general success • Architectures not ready for 1000 CPUs / chip • Unlike Instruction Level Parallelism, cannot be solved by just by computer architects and compiler writers alone, but also cannot be solved without participation of computer architects • PARLab: Berkeley researchers from many backgrounds meeting since 2005 to discuss parallelism – Krste Asanovic, Ras Bodik, Jim Demmel, Kurt Keutzer, John Kubiatowicz, Edward Lee, George Necula, Dave Patterson, Koushik Sen, John Shalf, John Wawrzynek, Kathy Yelick, … – Circuit design, computer architecture, massively parallel computing, computer-aided design, embedded hardware and software, programming languages, compilers, 49 2021/10/16 scientific programming, and 中国科学技术大学 numerical analysis 49

未来的热点之一 2021/10/16 中国科学技术大学 Source: Richard Newton 50

未来的热点之一 2021/10/16 中国科学技术大学 Source: Richard Newton 50

一些有趣的数据 • Intel 4004 是为嵌入式应用(计算器)设计 • 现在的微处理器 – 95% 用于嵌入式系统 » SSH 3/4 (Hitachi):

一些有趣的数据 • Intel 4004 是为嵌入式应用(计算器)设计 • 现在的微处理器 – 95% 用于嵌入式系统 » SSH 3/4 (Hitachi): best selling RISC microprocessor – 50% 的收入来源于嵌入式系统 • 应用于特定领域的微处理器 – – – 2021/10/16 Microcontrollers DSPs Media Processors Graphics Processors Network and Communication Processors 中国科学技术大学 51

It’s not just about bigger and faster! • Complete computing systems can be tiny

It’s not just about bigger and faster! • Complete computing systems can be tiny and cheap • System on a chip • Resource efficiency – Real-estate, power, pins, … 2021/10/16 中国科学技术大学 52

为什么学这门课 深入理解计算机体系结构有助于: • Write better programs – Understand the performance implications of algorithms, data

为什么学这门课 深入理解计算机体系结构有助于: • Write better programs – Understand the performance implications of algorithms, data structures, and programming language choices • Write better compilers – Modern computers need better optimizing compilers and better programming languages • Write better operating systems – Need to re-evaluate the current assumptions and tradeoffs – Example: gigabit networks • Design better computer architectures – There are still many challenges left – Example: the CPU-memory gap – ……. 2021/10/16 中国科学技术大学 54

计算机系统组成 性能评测 性能分析 Application OS Network,Storage Memory CPU 2021/10/16 中国科学技术大学 60

计算机系统组成 性能评测 性能分析 Application OS Network,Storage Memory CPU 2021/10/16 中国科学技术大学 60

性能的两种含义 Plane DC to Paris Speed Passengers Throughput (pmph) Boeing 747 6. 5 hours

性能的两种含义 Plane DC to Paris Speed Passengers Throughput (pmph) Boeing 747 6. 5 hours 610 mph 470 286, 700 BAD/Sud Concorde 3 hours 1350 mph 132 178, 200 哪个性能高? ° Time to do the task (Execution Time) – execution time, response time, latency ° Tasks per day, hour, week, sec, ns. . . (Performance) – throughput, bandwidth 这两者经常会有冲突的。 2021/10/16 中国科学技术大学 64

举例 • Time of Concord vs. Boeing 747? • Concord is 1350 mph /

举例 • Time of Concord vs. Boeing 747? • Concord is 1350 mph / 610 mph = 2. 2 times faster = 6. 5 hours / 3 hours • Throughput of Concorde vs. Boeing 747 ? • Concord is 178, 200 pmph / 286, 700 pmph = 0. 62 “times faster” • Boeing is 286, 700 pmph / 178, 200 pmph = 1. 60 “times faster” • Boeing is 1. 6 times (“ 60%”) faster in terms of throughput • Concord is 2. 2 times (“ 120%”) faster in terms of flying time 我们主要关注单个任务的执行时间 程序由一组指令构成,指令的吞吐率(Instruction throughput)非 常重要! 2021/10/16 中国科学技术大学 66

CPU性能度量 • Response time (elapsed time): 包括完成一个任务所需要的所有时间 • User CPU Time (90. 7) •

CPU性能度量 • Response time (elapsed time): 包括完成一个任务所需要的所有时间 • User CPU Time (90. 7) • System CPU Time (12. 9) • Elapsed Time (2: 39) 例如:unix 中的time命令 90. 7 u 12. 9 s 2: 39 CPU time = Seconds Program 2021/10/16 65% (90. 7/159) = Instructions x Program 中国科学技术大学 Cycles x Seconds Instruction Cycle 70

CPU性能度量-CPI “Average cycles per instruction” CPIave = (CPU Time * Clock Rate) / Instruction

CPU性能度量-CPI “Average cycles per instruction” CPIave = (CPU Time * Clock Rate) / Instruction Count = Clock Cycles / Instruction Count n CPU time = Clock. Cycle. Time * (CPIi * II ) i =1 n CPI = CPI i * i =1 F i where F i = I /Instruction Count i "instruction frequency" 2021/10/16 中国科学技术大学 71

CPI计算举例 Base Machine (Reg / Reg) Op Freq CPIi*Fi ALU 50% 1. 5 Load

CPI计算举例 Base Machine (Reg / Reg) Op Freq CPIi*Fi ALU 50% 1. 5 Load 20% 2. 4 Store 10% 2. 2 Branch 20% 2. 4 1. 5 2021/10/16 (% Time) (33%) (27%) (13%) (27%) 中国科学技术大学 72

Program Inst Count X CPI X Compiler X (X) Inst. Set. X X Organization

Program Inst Count X CPI X Compiler X (X) Inst. Set. X X Organization X Technology 2021/10/16 Clock Rate (X) X X 中国科学技术大学 73

Computer Performance Name FLOPS yotta. FLOPS 1024 zetta. FLOPS 1021 exa. FLOPS 1018 peta.

Computer Performance Name FLOPS yotta. FLOPS 1024 zetta. FLOPS 1021 exa. FLOPS 1018 peta. FLOPS 1015 tera. FLOPS 1012 giga. FLOPS 109 mega. FLOPS 106 kilo. FLOPS 103 2021/10/16 中国科学技术大学 75

基准测试程序套件 • Embedded Microprocessor Benchmark Consortium (EEMBC) • Desktop Benchmarks – – – SPEC

基准测试程序套件 • Embedded Microprocessor Benchmark Consortium (EEMBC) • Desktop Benchmarks – – – SPEC 2006 SPEC 2000 SPEC 95 SPEC 92 SPEC 89 • Server Benchmarks – Processor Throughput-oriented benchmarks (基于SPEC CPU benchmarks->SPECrate – SPECSFS, SPECWeb – Transaction-processing (TP) benchmarks (TPC-A, TPC-C, …) 2021/10/16 中国科学技术大学 77

本章小结 • 设计发展趋势 Capacity Speed Logic 2 x in 3 years DRAM 4 x

本章小结 • 设计发展趋势 Capacity Speed Logic 2 x in 3 years DRAM 4 x in 3 years 2 x in 10 years Disk 4 x in 3 years 2 x in 10 years • 运行任务的时间 – Execution time, response time, latency • 单位时间内完成的任务数 – Throughput, bandwidth • “X性能是Y的n倍 ” : Ex. Time(Y) ----Ex. Time(X) 2021/10/16 = Performance(X) -------Performance(Y) 中国科学技术大学 96

本章小结(续) • Amdahl’s 定律: Speedupoverall = Ex. Timeold Ex. Timenew 1 = (1 -

本章小结(续) • Amdahl’s 定律: Speedupoverall = Ex. Timeold Ex. Timenew 1 = (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced • CPI Law: CPU time = Seconds Program = Instructions x Program Cycles x Seconds Instruction Cycle • 执行时间是计算机系统度量的最实际,最可靠的方 式 2021/10/16 中国科学技术大学 97

review • Amdahl's Law Speedup(with E) = 1/((1 -F)+F/S)) • CPU time = CPI

review • Amdahl's Law Speedup(with E) = 1/((1 -F)+F/S)) • CPU time = CPI * IC * T CPU time = Seconds Program = Instructions x Program Cycles x Seconds Instruction Cycle • 基本评估方法-benchmark测试 2021/10/16 中国科学技术大学 101