CPU Performance Pipelined CPU Hakim Weatherspoon CS 3410

A Simple CPU: remaining branch instructions 2

Memory Layout Examples (big/little endian): # r 5 contains 5 (0 x 00000005) sb

Control Flow: More Branches Conditional Jumps (cont. ) 0000010010100000000010 op 6 bits rs subop

Absolute Jump Prog. inst Mem ALU Reg. File +4 555 PC offset imm +

Control Flow: Jump and Link Function/procedure calls 00001100000001001000011000000010 op immediate 6 bits 26 bits

Absolute Jump Prog. inst Mem +4 +4 555 PC control offset + || op

Design Goals What to look for in a computer system? • Correctness: negotiable? •

Performance How to measure performance? GHz (billions of cycles per second) MIPS (millions of

How Fast? Prog. Mem PC new pc ALU Reg. File ~ 3 gates control

Adder Performance 32 Bit Adder Design Ripple Carry 2 -Way Carry-Skip 3 -Way Carry-Skip

Optimization: Summary Critical Path • Longest path from a register output to a register

Processor Clock Cycle memory register file =? PC control offset new pc alu target

Processor Clock Cycle memory register file alu PC control offset new pc target op

Strategy 2 Multi-Cycle Instructions • Multiple cycles to complete a single instruction E. g:

CPI Instruction mix for some program P, assume: • 25% load/store ( 3 cycles

Example Goal: Make Multi-Cycle @ 30 MHz CPU (15 MIPS) run 2 x faster

Amdahl’s Law Execution time after improvement = execution time affected by improvement amount of

The Kids Alice Bob They don’t always get along… 21

The Instructions N pieces, each built following same sequence: Saw Drill Glue Paint 24

Design 1: Sequential Schedule Alice owns the room Bob can enter when Alice is

Sequential Performance time 1 2 3 4 5 6 7 8… Latency: Elapsed Time

Design 2: Pipelined Design Partition room into stages of a pipeline Dave Carol Bob

time Pipelined Performance 1 2 3 4 5 6 7… Latency: Throughput: Concurrency: 28

Unequal Pipeline Stages 30 min 15 min 0 h 1 h Latency: Throughput: Concurrency:

Poorly-balanced Pipeline Stages 30 min 15 min 0 h 1 h Latency: Throughput: Concurrency:

Re-Balanced Pipeline Stages 30 min 15 min 0 h 1 h Latency: Throughput: Concurrency:

Splitting Pipeline Stages 30 min 15 min 0 h 1 h Latency: Throughput: Concurrency:

Pipeline Hazards Q: What if glue step of task 3 depends on output of

Lessons Principle: Throughput increased by parallel execution Pipelining: • Identify pipeline stages • Isolate

A Processor memory inst +4 register file +4 =? PC control offset new pc

A Processor memory inst register file alu +4 addr PC din control new pc

Basic Pipeline Five stage “RISC” load-store architecture 1. Instruction fetch (IF) – get instruction

Pipelined Implementation Break instructions across multiple clock cycles (five, in this case) Design a

Slides: 38

Download presentation

CPU Performance Pipelined CPU Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University See P&H Chapters 1. 4 and 4. 5

A Simple CPU: remaining branch instructions 2

Memory Layout Examples (big/little endian): # r 5 contains 5 (0 x 00000005) sb r 5, 2(r 0) lb r 6, 2(r 0) sw r 5, 8(r 0) lb r 7, 8(r 0) lb r 8, 11(r 0) 0 x 00000001 0 x 00000002 0 x 00000003 0 x 00000004 0 x 00000005 0 x 00000006 0 x 00000007 0 x 00000008 0 x 00000009 0 x 0000000 a 0 x 0000000 b. . . 0 xffff 3

Control Flow: More Branches Conditional Jumps (cont. ) 0000010010100000000010 op 6 bits rs subop 5 bits op subop mnemonic 0 x 1 0 x 0 BLTZ rs, offset 0 x 1 0 x 6 0 x 0 0 x 7 0 x 0 offset 16 bits almost I-Type signed offsets description if R[rs] < 0 then PC = PC+4+ (offset<<2) BGEZ rs, offset if R[rs] ≥ 0 then PC = PC+4+ (offset<<2) BLEZ rs, offset if R[rs] ≤ 0 then PC = PC+4+ (offset<<2) BGTZ rs, offset if R[rs] > 0 then PC = PC+4+ (offset<<2) 4

Absolute Jump Prog. inst Mem ALU Reg. File +4 555 PC offset imm + || control tgt op subop mnemonic 0 x 1 0 x 0 BLTZ rs, offset 0 x 1 0 x 6 0 x 0 =? cmp addr Data Mem ext Could have used ALU for branch cmp description if R[rs] < 0 then PC = PC+4+ (offset<<2) BGEZ rs, offset if R[rs] ≥ 0 then PC = PC+4+ (offset<<2) BLEZ rs, offset if R[rs] ≤ 0 then PC = PC+4+ (offset<<2) 5

Control Flow: Jump and Link Function/procedure calls 00001100000001001000011000000010 op immediate 6 bits 26 bits op 0 x 3 op 0 x 2 J-Type mnemonic JAL target description r 31 = PC+8 (+8 due to branch delay slot) PC = (PC+4)31. . 28 || (target << 2) mnemonic J target description PC = (PC+4)31. . 28 || (target << 2) 6

Absolute Jump Prog. inst Mem +4 +4 555 PC control offset + || op 0 x 3 ALU Reg. File imm tgt mnemonic JAL target =? cmp addr Data Mem ext Could have used ALU for link add description r 31 = PC+8 (+8 due to branch delay slot) PC = (PC+4)31. . 28 || (target << 2) 7

Performance See: P&H 1. 4 8

Design Goals What to look for in a computer system? • Correctness: negotiable? • Cost – purchase cost = f(silicon size = gate count, economics) – operating cost = f(energy, cooling) – operating cost >= purchase cost • Efficiency – power = f(transistor usage, voltage, wire size, clock rate, …) – heat = f(power) • Intel Core i 7 Bloomfield: 130 Watts • AMD Turion: 35 Watts • Intel Core 2 Solo: 5. 5 Watts • Cortex-A 9 Dual Core @800 MHz: 0. 4 Watts • Performance • Other: availability, size, greenness, features, … 9

Performance How to measure performance? GHz (billions of cycles per second) MIPS (millions of instructions per second) MFLOPS (millions of floating point operations per second) benchmarks (SPEC, TPC, …) Metrics latency: how long to finish my program throughput: how much work finished per unit time 10

How Fast? Prog. Mem PC new pc ALU Reg. File ~ 3 gates control Assumptions: • • alu: 32 bit ripple carry + some muxes next PC: 30 bit ripple carry control: minimized for delay (~3 gates) transistors: 2 ns per gate prog, . memory: 16 ns (as much as 8 gates) register file: 2 ns access ignore wires, register setup time Better: • • alu: 32 bit carry lookahead + some muxes (~ 9 gates) next PC: 30 bit carry lookahead (~ 6 gates) Better Still: • next PC: cheapest adder faster than 21 gate delays All signals are stable • 80 gates => clock period of at least 160 ns, max frequency ~6 MHz Better: • 21 gates => clock period of at least 42 ns, max frequency ~24 MHz 11

Adder Performance 32 Bit Adder Design Ripple Carry 2 -Way Carry-Skip 3 -Way Carry-Skip Space ≈ 300 gates ≈ 360 gates ≈ 500 gates Time ≈ 64 gate delays ≈ 35 gate delays ≈ 22 gate delays 4 -Way Carry-Skip 2 -Way Look-Ahead Split Look-Ahead Full Look-Ahead ≈ 600 gates ≈ 550 gates ≈ 800 gates ≈ 1200 gates ≈ 18 gate delays ≈ 16 gate delays ≈ 10 gate delays ≈ 5 gate delays 12

Optimization: Summary Critical Path • Longest path from a register output to a register input • Determines minimum cycle, maximum clock frequency Strategy 1 (we just employed) • Optimize for delay on the critical path • Optimize for size / power / simplicity elsewhere – next PC 13

Processor Clock Cycle memory register file =? PC control offset new pc alu target imm cmp addr din dout memory extend op 0 x 20 mnemonic LB rd, offset(rs) description R[rd] = sign_ext(Mem[offset+R[rs]]) 0 x 23 0 x 28 0 x 2 b LW rd, offset(rs) SB rd, offset(rs) SW rd, offset(rs) R[rd] = Mem[offset+R[rs]] = R[rd] 14

Processor Clock Cycle memory register file alu PC control offset new pc target op 0 x 0 func 0 x 08 op 0 x 2 addr =? imm cmp dout memory extend mnemonic JR rs mnemonic J target din description PC = R[rs] description PC = (PC+4)31. . 28 || (target << 2) 15

Strategy 2 Multi-Cycle Instructions • Multiple cycles to complete a single instruction E. g: Assume: • load/store: 100 ns • arithmetic: 50 ns • branches: 33 ns Multi-Cycle CPU 30 MHz (33 ns cycle) with – 3 cycles per load/store – 2 cycles per arithmetic – 1 cycle per branch Faster than Single-Cycle CPU? 10 MHz (100 ns cycle) with – 1 cycle per instruction 16

CPI Instruction mix for some program P, assume: • 25% load/store ( 3 cycles / instruction) • 60% arithmetic ( 2 cycles / instruction) • 15% branches ( 1 cycle / instruction) Multi-Cycle performance for program P: 3 *. 25 + 2 *. 60 + 1 *. 15 = 2. 1 average cycles per instruction (CPI) = 2. 1 Multi-Cycle @ 30 MHz Single-Cycle @ 15 MHz 800 MHz PIII “faster” than 1 GHz P 4 17

Example Goal: Make Multi-Cycle @ 30 MHz CPU (15 MIPS) run 2 x faster by making arithmetic instructions faster Instruction mix (for P): • 25% load/store, CPI = 3 • 60% arithmetic, CPI = 2 • 15% branches, CPI = 1 18

Amdahl’s Law Execution time after improvement = execution time affected by improvement amount of improvement + execution time unaffected Or: Speedup is limited by popularity of improved feature Corollary: Make the common case fast Caveat: Law of diminishing returns 19

Pipelining See: P&H Chapter 4. 5 20

The Kids Alice Bob They don’t always get along… 21

The Bicycle 22

The Materials Drill Saw Glue Paint 23

The Instructions N pieces, each built following same sequence: Saw Drill Glue Paint 24

Design 1: Sequential Schedule Alice owns the room Bob can enter when Alice is finished Repeat for remaining tasks No possibility for conflicts 25

Sequential Performance time 1 2 3 4 5 6 7 8… Latency: Elapsed Time for Alice: 4 Throughput: Elapsed Time for Bob: 4 Concurrency: Total elapsed time: 4*N Can we do better? 26

Design 2: Pipelined Design Partition room into stages of a pipeline Dave Carol Bob Alice One person owns a stage at a time 4 stages 4 people working simultaneously Everyone moves right in lockstep 27

time Pipelined Performance 1 2 3 4 5 6 7… Latency: Throughput: Concurrency: 28

Unequal Pipeline Stages 30 min 15 min 0 h 1 h Latency: Throughput: Concurrency: 45 min 2 h 90 min 3 h… 29

Poorly-balanced Pipeline Stages 30 min 15 min 0 h 1 h Latency: Throughput: Concurrency: 45 min 2 h 90 min 3 h… 30

Re-Balanced Pipeline Stages 30 min 15 min 0 h 1 h Latency: Throughput: Concurrency: 45 min 2 h 90 min 3 h… 31

Splitting Pipeline Stages 30 min 15 min 0 h 1 h Latency: Throughput: Concurrency: 45 min 2 h 45+45 min 3 h… 32

Pipeline Hazards Q: What if glue step of task 3 depends on output of task 1? 0 h 1 h Latency: Throughput: Concurrency: 2 h 3 h… 33

Lessons Principle: Throughput increased by parallel execution Pipelining: • Identify pipeline stages • Isolate stages from each other • Resolve pipeline hazards 34

A Processor memory inst +4 register file +4 =? PC control offset new pc alu target imm cmp addr din dout memory extend 35

A Processor memory inst register file alu +4 addr PC din control new pc Instruction Fetch imm extend Instruction Decode dout memory compute jump/branch targets Execute Memory Write. Back 36

Basic Pipeline Five stage “RISC” load-store architecture 1. Instruction fetch (IF) – get instruction from memory, increment PC 2. Instruction Decode (ID) – translate opcode into control signals and read registers 3. Execute (EX) – perform ALU operation, compute jump/branch targets 4. Memory (MEM) – access memory if needed 5. Writeback (WB) – update register file 37

Pipelined Implementation Break instructions across multiple clock cycles (five, in this case) Design a separate stage for the execution performed during each clock cycle Add pipeline registers (flip-flops) to isolate signals between different stages Slides thanks to Kevin Walsh, Sally Mc. Kee, and Kavita Bala 38