CPSC 614 Graduate Computer Architecture 2006 Spring Introduction

Outline • • Why Take CPSC 614? Fundamental Abstractions & Concepts Instruction Set Architecture

What is “Computer Architecture”? Application Operating System Compiler Firmware Instr. Set Proc. I/O system

Why take CPSC 614? • To design the next great instruction set? . .

Coping with CPSC 614 • Students with too varied background? • Review: CPSC 321

Review of Fundamental Concepts • • Instruction Set Architecture Machine Organization Instruction Execution Cycle

Example Hot Developments ca. 2006 • Manipulating the instruction set abstraction – – itanium:

Forces on Computer Architecture Technology Programming Languages Applications Computer Architecture Operating Systems History (A

Amazing Underlying Technology Change cpsc 614 Lec 1. 9

A take on Moore’s Law cpsc 614 Lec 1. 10

Technology Trends • • • Clock Rate: ~30% per year Transistor Density: ~35% Chip

Measurement and Evaluation Design Architecture is an iterative process -- searching the space of

The Instruction Set: a Critical Interface software instruction set hardware cpsc 614 Lec 1.

Levels of Representation temp = v[k]; High Level Language Program Compiler Assembly Language Program

Instruction Set Architecture. . . the attributes of a [computing] system as seen by

Organization • Capabilities & Performance Characteristics of Principal Functional Units – (e. g. ,

Review: MIPS R 3000 (core) r 0 r 1 ° ° ° r 31

Types of Internal Storage • Stack, Accumulator, A Set of Registers cpsc 614 Lec

Review: Basic ISA Classes Accumulator: 1 address add A acc ¬ acc + mem[A]

Instruction Formats Variable: … Fixed: Hybrid: • Addressing modes –each operand requires addess specifier

MIPS Addressing Modes & Formats • Simple addressing modes • All instructions 32 bits

Cray-1: the original RISC Register-Register 9 15 6 8 Op 3 2 5 Rd

VAX-11: the canonical CISC Variable format, 2 and 3 address instruction • Rich set

RISC vs. CISC • Pipelining • Ease of Hardware Implementation • Simple Instructions •

Review: Load/Store Architectures ° no memory reference per ALU instruction MEM reg - 3

MIPS R 3000 Instruction Set Architecture Registers ° Machine Environment Target ° Instruction Categories

Execution Cycle Instruction Obtain instruction from program storage Fetch Instruction Determine required actions and

What’s a Clock Cycle? Latch or register combinational logic • Old days: 10 levels

Fast, Pipelined Instruction Interpretation Next Instruction Address Instruction Fetch Instruction Register Decode & Operand

Sequential Laundry 6 PM 7 8 9 10 11 Midnight Time T a s

Pipelined Laundry Start work ASAP 6 PM 7 8 9 10 11 Midnight Time

Pipelining Lessons 6 PM 7 8 9 Time T a s k O r

Recap: A Single Cycle Datapath ° Rs, Rt, Rd and Imed 16 hardwired into

5 Steps of MIPS Datapath Figure 3. 1, Page 130, CA: AQA 2 e

5 Steps of MIPS Datapath Figure 3. 4, Page 134 , CA: AQA 2

Visualizing Pipelining Figure 3. 3, Page 133 , CA: AQA 2 e Time (clock

Its Not That Easy for Computers • Limits to pipelining: Hazards prevent next instruction

Review of Performance cpsc 614 Lec 1. 39

Which is faster? Plane DC to Paris Speed Passengers Throughput (pmph) Boeing 747 6.

Definitions • Performance is in units of things per sec – bigger is better

CPI Computer Performance inst count CPU time = Seconds = Instructions x Program Cycles

Cycles Per Instruction (Throughput) “Average Cycles per Instruction” CPI = (CPU Time * Clock

Example: Calculating CPI bottom up Base Machine Op ALU Load Store Branch (Reg /

Example: Branch Stall Impact • Assume CPI = 1. 0 ignoring branches (ideal) •

Speed Up Equation for Pipelining For simple RISC pipeline, CPI = 1: cpsc 614

• Ex) Suppose we enhance a machine to make all floating-point instructions run

Summary • Modern Computer Architecture is about managing and optimizing across several levels of

Instruction Pipelining • Execute billions of instructions, so throughput is what matters – except

MIPS R 3000 ISA (Summary) Registers • Instruction Categories – – – Load/Store Computational

Example: MIPS (Note register location) Register-Register 31 26 25 Op 21 20 Rs 1

Slides: 51

Download presentation

CPSC 614: Graduate Computer Architecture 2006 Spring Introduction Based on Lectures by: Prof. David E Culler Prof. David Patterson UC Berkeley cpsc 614 Lec 1. 1

Outline • • Why Take CPSC 614? Fundamental Abstractions & Concepts Instruction Set Architecture & Organization Pipelined Instruction Processing Performance The Memory Abstraction Summary cpsc 614 Lec 1. 2

What is “Computer Architecture”? Application Operating System Compiler Firmware Instr. Set Proc. I/O system Instruction Set Architecture Datapath & Control Digital Design Circuit Design Layout • Coordination of many levels of abstraction • Under a rapidly changing set of forces • Design, Measurement, and Evaluation cpsc 614 Lec 1. 3

Why take CPSC 614? • To design the next great instruction set? . . . well. . . • Tremendous organizational innovation relative to established ISA abstractions • Many New instruction sets or equivalent – embedded space, controllers, specialized devices, . . . • Design, analysis, implementation concepts vital to all aspects of EE & CS – systems, PL, theory, circuit design, VLSI, comm. • Equip you with an intellectual toolbox for dealing with a host of systems design challenges cpsc 614 Lec 1. 4

Coping with CPSC 614 • Students with too varied background? • Review: CPSC 321 and/or “Computer Organization and Design (COD)2/e” – Chapters 1 to 8 of COD if never took prerequisite – If took a class, be sure COD Chapters 2, 6, 7 are familiar • FAST review this week of basic concepts cpsc 614 Lec 1. 5

Review of Fundamental Concepts • • Instruction Set Architecture Machine Organization Instruction Execution Cycle Pipelining Memory Bus (Peripheral Hierarchy) Performance Iron Triangle cpsc 614 Lec 1. 6

Example Hot Developments ca. 2006 • Manipulating the instruction set abstraction – – itanium: translate ISA 64 -> micro-op sequences transmeta: continuous dynamic translation of IA 32 tinsilica: synthesize the ISA from the application reconfigurable HW • Virtualization – vmware: emulate full virtual machine – JIT: compile to abstract virtual machine, dynamically compile to host • Parallelism – wide issue, dynamic instruction scheduling, EPIC – multithreading (SMT) – chip multiprocessors • Communication – network processors, network interfaces • Exotic explorations – nanotechnology, quantum computing + Energy Efficiency + Reliability !! High performance cpsc 614 Lec 1. 7

Forces on Computer Architecture Technology Programming Languages Applications Computer Architecture Operating Systems History (A = F / M) cpsc 614 Lec 1. 8

Amazing Underlying Technology Change cpsc 614 Lec 1. 9

A take on Moore’s Law cpsc 614 Lec 1. 10

Technology Trends • • • Clock Rate: ~30% per year Transistor Density: ~35% Chip Area: ~15% Transistors per chip: ~55% Total Performance Capability: ~100% by the time you graduate. . . – 3 x clock rate (3 -4 GHz) – 10 x transistor count (1 Billion transistors) – 30 x raw capability • plus 16 x dram density, 32 x disk density cpsc 614 Lec 1. 11

Performance Trends cpsc 614 Lec 1. 12

Measurement and Evaluation Design Architecture is an iterative process -- searching the space of possible designs -- at all levels of computer systems Analysis Creativity Cost / Performance Analysis Good Ideas Bad Ideas Mediocre Ideas cpsc 614 Lec 1. 13

The Instruction Set: a Critical Interface software instruction set hardware cpsc 614 Lec 1. 14

Levels of Representation temp = v[k]; High Level Language Program Compiler Assembly Language Program Assembler Machine Language Program v[k] = v[k+1]; v[k+1] = temp; lw $15, lw $16, sw $15, 0000 1010 1100 0101 1001 1111 0110 1000 1100 0101 1010 0000 0110 1000 1111 1001 0($2) 4($2) 1010 0000 0101 1100 1111 1000 0110 0101 1100 0000 1010 1000 0110 1001 1111 Machine Interpretation Control Signal Specification ALUOP[0: 3] <= Inst. Reg[9: 11] & MASK ° ° cpsc 614 Lec 1. 15

Instruction Set Architecture. . . the attributes of a [computing] system as seen by the programmer, i. e. the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation. – Amdahl, Blaaw, and Brooks, 1964 SOFTWARE -- Organization of Programmable Storage -- Data Types & Data Structures: Encodings & Representations -- Instruction Formats -- Instruction (or Operation Code) Set -- Modes of Addressing and Accessing Data Items and Instructions -- Exceptional Conditions cpsc 614 Lec 1. 16

Organization • Capabilities & Performance Characteristics of Principal Functional Units – (e. g. , Registers, ALU, Shifters, Logic Units, . . . ) Logic Designer's View ISA Level FUs & Interconnect • Ways in which these components are interconnected • Information flows between components • Logic and means by which such information flow is controlled. • Choreography of FUs to realize the ISA • Register Transfer Level (RTL) Description cpsc 614 Lec 1. 17

Review: MIPS R 3000 (core) r 0 r 1 ° ° ° r 31 PC lo hi 0 Programmable storage Data types ? 2^32 x bytes Format ? 31 x 32 -bit GPRs (R 0=0) Addressing Modes? 32 x 32 -bit FP regs (paired DP) HI, LO, PC Arithmetic logical Add, Add. U, Sub. U, And, Or, Xor, Nor, SLTU, Add. IU, SLTIU, And. I, Or. I, Xor. I, LUI SLL, SRA, SLLV, SRAV Memory Access LB, LBU, LHU, LWL, LWR SB, SH, SWL, SWR Control 32 -bit instructions on word boundary J, JAL, JR, JALR BEq, BNE, BLEZ, BGTZ, BLTZ, BGEZ, BLTZAL, BGEZAL cpsc 614 Lec 1. 18

Types of Internal Storage • Stack, Accumulator, A Set of Registers cpsc 614 Lec 1. 19

Review: Basic ISA Classes Accumulator: 1 address add A acc ¬ acc + mem[A] 1+x address addx A acc ¬ acc + mem[A + x] Stack: 0 address add tos ¬ tos + next General Purpose Register: 2 address add A B EA(A) ¬ EA(A) + EA(B) 3 address add A B C EA(A) ¬ EA(B) + EA(C) Load/Store: load Ra Rb Ra ¬ mem[Rb] store Ra Rb mem[Rb] ¬ Ra cpsc 614 Lec 1. 20

Instruction Formats Variable: … Fixed: Hybrid: • Addressing modes –each operand requires addess specifier => variable format • code size => variable length instructions • performance => fixed length instructions –simple decoding, predictable operations • With load/store instruction arch, only one memory address and few addressing modes • => simple format, address mode given by opcode cpsc 614 Lec 1. 21

MIPS Addressing Modes & Formats • Simple addressing modes • All instructions 32 bits wide Register (direct) op rs rt rd register Immediate Base+index op rs rt immed register PC-relative op rs PC rt Memory + immed Memory + cpsc 614 Lec 1. 22

Cray-1: the original RISC Register-Register 9 15 6 8 Op 3 2 5 Rd Rs 1 0 R 2 Load, Store and Branch 9 15 Op 6 8 Rd 3 2 5 Rs 1 0 15 0 Immediate cpsc 614 Lec 1. 23

VAX-11: the canonical CISC Variable format, 2 and 3 address instruction • Rich set of orthogonal address modes – immediate, offset, indexed, autoinc/dec, indirect+offset – applied to any operand • Simple and complex instructions – synchronization instructions – data structure operations (queues) – polynomial evaluation cpsc 614 Lec 1. 24

RISC vs. CISC • Pipelining • Ease of Hardware Implementation • Simple Instructions • Simple Addressing Mode • Fixed-Length Formats • Large Number of Registers • MIPS, … • Simple Compilers • Powerful Addressing Mode • Powerful Instructions • Efficient Instruction Encoding • Few Registers • VAX cpsc 614 Lec 1. 25

Review: Load/Store Architectures ° no memory reference per ALU instruction MEM reg - 3 address GPR ° Register to register arithmetic ° Load and store with simple addressing modes (reg + immediate) ° Simple conditionals op r r r compare ops + branch z compare&branch op r r immed condition code + branch on condition op offset ° Simple fixed-format encoding cpsc 614 Lec 1. 26

MIPS R 3000 Instruction Set Architecture Registers ° Machine Environment Target ° Instruction Categories § § R 0 - R 31 Load/Store Computational Jump and Branch Floating Point (coprocessor) PC HI LO 3 Instruction Formats: all 32 bits wide R: I: J: OP Rs Rt OP Rd sa funct Immediate jump target cpsc 614 Lec 1. 27

Execution Cycle Instruction Obtain instruction from program storage Fetch Instruction Determine required actions and instruction size Decode Operand Locate and obtain operand data Fetch Execute Result Compute result value or status Deposit results in storage for later use Store Next Instruction Determine successor instruction cpsc 614 Lec 1. 28

What’s a Clock Cycle? Latch or register combinational logic • Old days: 10 levels of gates • Today: determined by numerous time-offlight issues + gate delays – clock propagation, wire lengths, drivers cpsc 614 Lec 1. 29

Fast, Pipelined Instruction Interpretation Next Instruction Address Instruction Fetch Instruction Register Decode & Operand Fetch Operand Registers NI NI IF NI NI NI IF IF D D E E E W W D E W W Time Execute Result Registers Store Results Registers or Mem cpsc 614 Lec 1. 30

Sequential Laundry 6 PM 7 8 9 10 11 Midnight Time T a s k O r d e r 30 40 20 A B C D • Sequential laundry takes 6 hours for 4 loads • If they learned pipelining, how long would laundry take? cpsc 614 Lec 1. 31

Pipelined Laundry Start work ASAP 6 PM 7 8 9 10 11 Midnight Time T a s k O r d e r 30 40 40 20 A B C D • Pipelined laundry takes 3. 5 hours for 4 loads cpsc 614 Lec 1. 32

Pipelining Lessons 6 PM 7 8 9 Time T a s k O r d e r 30 40 40 20 A B C D • Pipelining doesn’t help latency of single task, it helps throughput of entire workload • Pipeline rate limited by slowest pipeline stage • Multiple tasks operating simultaneously • Potential speedup = Number pipe stages • Unbalanced lengths of pipe stages reduces speedup • Time to “fill” pipeline and time to “drain” it reduces speedup cpsc 614 Lec 1. 33

Recap: A Single Cycle Datapath ° Rs, Rt, Rd and Imed 16 hardwired into datapath from Fetch Unit ° We have everything except control signals (underline) • Today’s lecture will show you how to generate the control signals Instruction<31: 0> 1 Mux 0 Reg. Wr 5 5 Rs 5 Rt Rt Zero ALUctr Memto. Reg 0 32 Data In 32 ALUSrc Ext. Op Imm 16 Clk Wr. En Adr Data Memory 32 Mux 0 1 32 ALU 16 Extender imm 16 32 Mux 32 Clk Rw Ra Rb 32 32 -bit Registers bus. B 32 Rd Mem. Wr bus. A bus. W Rs <0: 15> Clk <11: 15> Reg. Dst Rt <21: 25> Rd Instruction Fetch Unit <16: 20> n. PC_sel 1

5 Steps of MIPS Datapath Figure 3. 1, Page 130, CA: AQA 2 e Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Next SEQ PC Adder 4 Zero? RS 1 L M D MUX Data Memory ALU Imm MUX RD Reg File Inst Memory Address RS 2 Write Back MUX Next PC Memory Access Sign Extend WB Data cpsc 614 Lec 1. 35

5 Steps of MIPS Datapath Figure 3. 4, Page 134 , CA: AQA 2 e Execute Addr. Calc Instr. Decode Reg. Fetch Next SEQ PC Adder 4 Zero? RS 1 MUX MEM/WB Data Memory EX/MEM ALU MUX ID/EX Imm Reg File IF/ID Memory Address RS 2 Write Back MUX Next PC Memory Access WB Data Instruction Fetch Sign Extend RD RD RD cpsc 614 Lec 1. 36

Visualizing Pipelining Figure 3. 3, Page 133 , CA: AQA 2 e Time (clock cycles) Ifetch DMem Reg ALU O r d e r Ifetch ALU I n s t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Reg Reg DMem Reg cpsc 614 Lec 1. 37

Its Not That Easy for Computers • Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle – Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) – Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) – Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps). cpsc 614 Lec 1. 38

Review of Performance cpsc 614 Lec 1. 39

Which is faster? Plane DC to Paris Speed Passengers Throughput (pmph) Boeing 747 6. 5 hours 610 mph 470 286, 700 BAD/Sud Concorde 3 hours 1350 mph 132 178, 200 • Time to run the task (Ex. Time) – Execution time, response time, latency • Tasks per day, hour, week, sec, ns … (Performance) – Throughput, bandwidth cpsc 614 Lec 1. 40

Definitions • Performance is in units of things per sec – bigger is better • If we are primarily concerned with response time – performance(x) = 1 execution_time(x) " X is n times faster than Y" means Execution_time(Y) Performance(X) n = = Performance(Y) Execution_time(X) cpsc 614 Lec 1. 41

CPI Computer Performance inst count CPU time = Seconds = Instructions x Program Cycles Cycle time x Seconds Instruction Cycle Inst Count CPI Clock Rate Program X Compiler X (X) Inst. Set. X X Organization Technology X X X cpsc 614 Lec 1. 42

Cycles Per Instruction (Throughput) “Average Cycles per Instruction” CPI = (CPU Time * Clock Rate) / Instruction Count = Cycles / Instruction Count “Instruction Frequency” cpsc 614 Lec 1. 43

Example: Calculating CPI bottom up Base Machine Op ALU Load Store Branch (Reg / Freq 50% 20% 10% 20% Reg) Cycles 1 2 2 2 Typical Mix of instruction types in program CPI(i). 5. 4. 2. 4 1. 5 (% Time) (33%) (27%) (13%) (27%) cpsc 614 Lec 1. 44

Example: Branch Stall Impact • Assume CPI = 1. 0 ignoring branches (ideal) • Assume solution was stalling for 3 cycles • If 30% branch, Stall 3 cycles on 30% • Op • Other • Branch Freq 70% 30% Cycles CPI(i) (% Time) 1. 7 (37%) 4 1. 2 (63%) • => new CPI = 1. 9 • New machine is 1/1. 9 = 0. 52 times faster (i. e. slow!) cpsc 614 Lec 1. 45

Speed Up Equation for Pipelining For simple RISC pipeline, CPI = 1: cpsc 614 Lec 1. 46

• Ex) Suppose we enhance a machine to make all floating-point instructions run five times faster. If the execution time of some benchmark before the floating-point enhancement is 20 seconds, what will speedup be if half of the 20 seconds is spent executing floating-point instructions? cpsc 614 Lec 1. 47

Summary • Modern Computer Architecture is about managing and optimizing across several levels of abstraction wrt dramatically changing technology and application load • Key Abstractions – instruction set architecture – memory – bus • Key concepts – – HW/SW boundary Compile Time / Run Time Pipelining Caching • Performance Iron Triangle relates combined effects – Total Time = Inst. Count x CPI + Cycle Time cpsc 614 Lec 1. 48

Instruction Pipelining • Execute billions of instructions, so throughput is what matters – except when? • What is desirable in instruction sets for pipelining? – Variable length instructions vs. all instructions same length? – Memory operands part of any operation vs. memory operands only in loads or stores? – Register operand many places in instruction format vs. registers located in same place? cpsc 614 Lec 1. 49

MIPS R 3000 ISA (Summary) Registers • Instruction Categories – – – Load/Store Computational Jump and Branch Floating Point » coprocessor Memory Management Special R 0 - R 31 PC HI LO 3 Instruction Formats: all 32 bits wide OP rs rt OP rd sa funct immediate jump target cpsc 614 Lec 1. 50

Example: MIPS (Note register location) Register-Register 31 26 25 Op 21 20 Rs 1 16 15 Rs 2 11 10 6 5 Rd 0 Opx Register-Immediate 31 26 25 Op 21 20 Rs 1 16 15 Rd immediate 0 Branch 31 26 25 Op Rs 1 21 20 16 15 Rs 2/Opx immediate 0 Jump / Call 31 26 25 Op target 0 cpsc 614 Lec 1. 51