CPU Performance Pipelined CPU Hakim Weatherspoon CS 3410

  • Slides: 44
Download presentation
CPU Performance Pipelined CPU Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

CPU Performance Pipelined CPU Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University See P&H Chapters 1. 4 and 4. 5

“In a major matter, no details are small” French Proverb

“In a major matter, no details are small” French Proverb

MIPS Design Principles Simplicity favors regularity • 32 bit instructions Smaller is faster •

MIPS Design Principles Simplicity favors regularity • 32 bit instructions Smaller is faster • Small register file Make the common case fast • Include support for constants Good design demands good compromises • Support for different type of interpretations/classes

Big Picture: Building a Processor memory inst +4 register file +4 =? PC control

Big Picture: Building a Processor memory inst +4 register file +4 =? PC control offset new pc alu target imm cmp extend A Single cycle processor addr din dout memory

Goals for today MIPS Datapath • Memory layout • Control Instructions Performance • CPI

Goals for today MIPS Datapath • Memory layout • Control Instructions Performance • CPI (Cycles Per Instruction) • MIPS (Instructions Per Cycle) • Clock Frequency Pipelining • Latency vs throuput

Memory Layout and Control instructions

Memory Layout and Control instructions

MIPS instruction formats All MIPS instructions are 32 bits long, has 3 formats R-type

MIPS instruction formats All MIPS instructions are 32 bits long, has 3 formats R-type op 6 bits I-type op 6 bits J-type rs rt 5 bits rs rt rd shamt func 5 bits 6 bits immediate 5 bits 16 bits op immediate (target address) 6 bits 26 bits

MIPS Instruction Types Arithmetic/Logical • R-type: result and two source registers, shift amount •

MIPS Instruction Types Arithmetic/Logical • R-type: result and two source registers, shift amount • I-type: 16 -bit immediate with sign/zero extension Memory Access • load/store between registers and memory • word, half-word and byte operations Control flow • conditional branches: pc-relative addresses • jumps: fixed offsets, register absolute

Memory Instructions 1010110010100000000100 op 6 bits rs rd 5 bits offset 16 bits I-Type

Memory Instructions 1010110010100000000100 op 6 bits rs rd 5 bits offset 16 bits I-Type base + offset addressing op 0 x 20 mnemonic LB rd, offset(rs) description R[rd] = sign_ext(Mem[offset+R[rs]]) 0 x 24 0 x 21 0 x 25 0 x 23 0 x 28 0 x 29 0 x 2 b LBU rd, offset(rs) LHU rd, offset(rs) LW rd, offset(rs) SB rd, offset(rs) SH rd, offset(rs) SW rd, offset(rs) R[rd] = zero_ext(Mem[offset+R[rs]]) R[rd] = sign_ext(Mem[offset+R[rs]]) R[rd] = zero_ext(Mem[offset+R[rs]]) R[rd] = Mem[offset+R[rs]] signed Mem[offset+R[rs]] = R[rd] offsets Mem[offset+R[rs]] = R[rd] ex: = Mem[4+r 5] = r 1 # SW r 1, 4(r 5)

Endianness: Ordering of bytes within a memory word Little Endian = least significant part

Endianness: Ordering of bytes within a memory word Little Endian = least significant part first (MIPS, x 86) 1000 1001 1002 1003 as 4 bytes 0 x 78 0 x 34 0 x 12 0 x 56 as 2 halfwords 0 x 5678 0 x 12345678 as 1 word Big Endian = most significant part first (MIPS, networks) 1000 1001 1002 1003 as 4 bytes 0 x 56 0 x 78 0 x 12 0 x 34 as 2 halfwords 0 x 1234 0 x 5678 0 x 12345678 as 1 word

Memory Layout Examples (big/little endian): # r 5 contains 5 (0 x 00000005) 0

Memory Layout Examples (big/little endian): # r 5 contains 5 (0 x 00000005) 0 x 05 SB r 5, 2(r 0) LB r 6, 2(r 0) # R[r 6] = 0 x 05 SW r 5, 8(r 0) LB r 7, 8(r 0) LB r 8, 11(r 0) # R[r 7] = 0 x 00 # R[r 8] = 0 x 05 0 x 00000001 0 x 00000002 0 x 00000003 0 x 00000004 0 x 00000005 0 x 00000006 0 x 00 0 x 05 0 x 00000007 0 x 00000008 0 x 00000009 0 x 0000000 a 0 x 0000000 b. . . 0 xffff

MIPS Instruction Types Arithmetic/Logical • R-type: result and two source registers, shift amount •

MIPS Instruction Types Arithmetic/Logical • R-type: result and two source registers, shift amount • I-type: 16 -bit immediate with sign/zero extension Memory Access • load/store between registers and memory • word, half-word and byte operations Control flow • conditional branches: pc-relative addresses • jumps: fixed offsets, register absolute

Control Flow: Absolute Jump 00001010100001001000011000000011 op 0 x 2 op immediate 6 bits 26

Control Flow: Absolute Jump 00001010100001001000011000000011 op 0 x 2 op immediate 6 bits 26 bits J-Type Mnemonic J target Description PC = (PC+4) target 31. . 28 || 00|| target || 00 ex: j 0 xa 12180 c (== j 1010 0001 0010 0001 1000 0000 11 || 00) PC = (PC+4)31. . 28||0 xa 12180 c Absolute addressing for jumps (PC+4)31. . 28 will be the same • Jump from 0 x 30000000 to 0 x 20000000? NO Reverse? NO – But: Jumps from 0 x 2 FFFFFFF to 0 x 3 xxxxxxx are possible, but not reverse • Trade-off: out-of-region jumps vs. 32 -bit instruction encoding MIPS Quirk: • jump targets computed using already incremented PC

Absolute Jump Prog. inst Mem ALU Reg. File +4 PC control imm || op

Absolute Jump Prog. inst Mem ALU Reg. File +4 PC control imm || op 0 x 2 addr 555 tgt Mnemonic J target Data Mem J ext Description PC = (PC+4)31. . 28 || target || 00

Control Flow: Jump Register 00000110000000001000 op rs 6 bits op 0 x 0 5

Control Flow: Jump Register 00000110000000001000 op rs 6 bits op 0 x 0 5 bits func 0 x 08 ex: JR r 3 - - - func 5 bits 6 bits mnemonic JR rs description PC = R[rs] R-Type

Jump Register R[r 3] Prog. inst Mem ALU Reg. File +4 addr 555 PC

Jump Register R[r 3] Prog. inst Mem ALU Reg. File +4 addr 555 PC control imm || op 0 x 0 tgt func 0 x 08 Data Mem JR ext mnemonic JR rs description PC = R[rs]

Examples E. g. Use Jump or Jump Register instruction to jump to 0 xabcd

Examples E. g. Use Jump or Jump Register instruction to jump to 0 xabcd 1234 But, what about a jump based on a condition? # assume 0 <= r 3 <= 1 if (r 3 == 0) jump to 0 xdecafe 00 else jump to 0 xabcd 1234

Control Flow: Branches 00010100000000011 op 6 bits rs rd 5 bits offset I-Type 16

Control Flow: Branches 00010100000000011 op 6 bits rs rd 5 bits offset I-Type 16 bits signed offsets op mnemonic 0 x 4 BEQ rs, rd, offset description if R[rs] == R[rd] then PC = PC+4 + (offset<<2) 0 x 5 BNE rs, rd, offset if R[rs] != R[rd] then PC = PC+4 + (offset<<2) ex: BEQ r 5, r 1, 3 If(R[r 5]==R[r 1]) then PC = PC+4 + 12 (i. e. 12 == 3<<2)

Examples (2) if (i == j) { i = i * 4; } else

Examples (2) if (i == j) { i = i * 4; } else { j = i - j; }

Absolute Jump R[r 5] Prog. inst Mem Reg. File ALU R[r 1] +4 555

Absolute Jump R[r 5] Prog. inst Mem Reg. File ALU R[r 1] +4 555 PC offset + || tgt control imm addr =? BEQ ext Could have used ALU for branch cmp Data Mem Could have used ALU for branch add op mnemonic 0 x 4 BEQ rs, rd, offset description if R[rs] == R[rd] then PC = PC+4 + (offset<<2) 0 x 5 BNE rs, rd, offset if R[rs] != R[rd] then PC = PC+4 + (offset<<2)

Control Flow: More Branches Conditional Jumps (cont. ) 0000010010100000000010 op 6 bits rs subop

Control Flow: More Branches Conditional Jumps (cont. ) 0000010010100000000010 op 6 bits rs subop 5 bits op subop mnemonic 0 x 1 0 x 0 BLTZ rs, offset 0 x 1 0 x 6 0 x 0 0 x 7 0 x 0 offset 16 bits almost I-Type signed offsets description if R[rs] < 0 then PC = PC+4+ (offset<<2) BGEZ rs, offset if R[rs] ≥ 0 then PC = PC+4+ (offset<<2) BLEZ rs, offset if R[rs] ≤ 0 then PC = PC+4+ (offset<<2) BGTZ rs, offset if R[rs] > 0 then PC = PC+4+ (offset<<2) ex: BGEZ r 5, 2 If(R[r 5] ≥ 0) then PC = PC+4 + 8 (i. e. 8 == 2<<2)

Absolute Jump R[r 5] Prog. inst Mem ALU Reg. File +4 555 PC offset

Absolute Jump R[r 5] Prog. inst Mem ALU Reg. File +4 555 PC offset imm + || control cmp BEQ tgt op subop mnemonic 0 x 1 0 x 0 BLTZ rs, offset 0 x 1 0 x 6 0 x 0 =? addr Data Mem ext Could have used ALU for branch cmp description if R[rs] < 0 then PC = PC+4+ (offset<<2) BGEZ rs, offset if R[rs] ≥ 0 then PC = PC+4+ (offset<<2) BLEZ rs, offset if R[rs] ≤ 0 then PC = PC+4+ (offset<<2)

Control Flow: Jump and Link Function/procedure calls 00001100000001001000011000000010 op immediate 6 bits 26 bits

Control Flow: Jump and Link Function/procedure calls 00001100000001001000011000000010 op immediate 6 bits 26 bits op mnemonic 0 x 3 JAL target J-Type Discuss later description r 31 = PC+8 (+8 due to branch delay slot) PC = (PC+4)31. . 28 || (target << 2) ex: JAL 0 x 0121808 (== JAL 0000 0001 0010 0001 1000 0000 10 <<2) r 31 = PC+8 PC = (PC+4)31. . 28||0 x 0121808 op 0 x 2 mnemonic J target description PC = (PC+4)31. . 28 || (target << 2)

Absolute Jump Prog. inst Mem PC+8 +4 R[r 31] +4 555 PC control offset

Absolute Jump Prog. inst Mem PC+8 +4 R[r 31] +4 555 PC control offset + || op 0 x 3 ALU Reg. File imm tgt mnemonic JAL target =? cmp addr Data Mem ext Could have used ALU for link add description r 31 = PC+8 (+8 due to branch delay slot) PC = (PC+4)31. . 28 || (target << 2)

Goals for today MIPS Datapath • Memory layout • Control Instructions Performance • CPI

Goals for today MIPS Datapath • Memory layout • Control Instructions Performance • CPI (Cycles Per Instruction) • MIPS (Instructions Per Cycle) • Clock Frequency Pipelining • Latency vs throughput

Next Goal How do we measure performance? What is the performance of a single

Next Goal How do we measure performance? What is the performance of a single cycle CPU? See: P&H 1. 4

Performance How to measure performance? • GHz (billions of cycles per second) • MIPS

Performance How to measure performance? • GHz (billions of cycles per second) • MIPS (millions of instructions per second) • MFLOPS (millions of floating point operations per second) • Benchmarks (SPEC, TPC, …) Metrics • latency: how long to finish my program • throughput: how much work finished per unit time

What instruction has the longest path A) LW B) SW C) ADD/SUB/AND/OR/etc D) BEQ

What instruction has the longest path A) LW B) SW C) ADD/SUB/AND/OR/etc D) BEQ E) J

Latency: Processor Clock Cycle memory register file =? PC control offset new pc alu

Latency: Processor Clock Cycle memory register file =? PC control offset new pc alu target imm cmp addr din dout memory extend op 0 x 20 mnemonic LB rd, offset(rs) description R[rd] = sign_ext(Mem[offset+R[rs]]) 0 x 23 0 x 28 0 x 2 b LW rd, offset(rs) SB rd, offset(rs) SW rd, offset(rs) R[rd] = Mem[offset+R[rs]] = R[rd]

Latency: Processor Clock Cycle Critical Path • Longest path from a register output to

Latency: Processor Clock Cycle Critical Path • Longest path from a register output to a register input • Determines minimum cycle, maximum clock frequency How do we make the CPU perform better (e. g. cheaper, cooler, go “faster”, …)? • Optimize for delay on the critical path • Optimize for size / power / simplicity elsewhere

Latency: Optimize Delay on Critical Path E. g. Adder performance 32 Bit Adder Design

Latency: Optimize Delay on Critical Path E. g. Adder performance 32 Bit Adder Design Ripple Carry 2 -Way Carry-Skip 3 -Way Carry-Skip Space ≈ 300 gates ≈ 360 gates ≈ 500 gates Time ≈ 64 gate delays ≈ 35 gate delays ≈ 22 gate delays 4 -Way Carry-Skip 2 -Way Look-Ahead Split Look-Ahead Full Look-Ahead ≈ 600 gates ≈ 550 gates ≈ 800 gates ≈ 1200 gates ≈ 18 gate delays ≈ 16 gate delays ≈ 10 gate delays ≈ 5 gate delays

Throughput: Multi-Cycle Instructions Strategy 2 • Multiple cycles to complete a single instruction E.

Throughput: Multi-Cycle Instructions Strategy 2 • Multiple cycles to complete a single instruction E. g: Assume: • load/store: 100 ns • arithmetic: 50 ns • branches: 33 ns Multi-Cycle CPU 30 MHz (33 ns cycle) with – 3 cycles per load/store – 2 cycles per arithmetic – 1 cycle per branch 10 MHz 20 MHz ms = 10 -3 second us = 10 -6 seconds ns = 10 -9 seconds 30 MHz Faster than Single-Cycle CPU? 10 MHz (100 ns cycle) with – 1 cycle per instruction

Cycles Per Instruction (CPI) Instruction mix for some program P, assume: • 25% load/store

Cycles Per Instruction (CPI) Instruction mix for some program P, assume: • 25% load/store ( 3 cycles / instruction) • 60% arithmetic ( 2 cycles / instruction) • 15% branches ( 1 cycle / instruction) Multi-Cycle performance for program P: 3 *. 25 + 2 *. 60 + 1 *. 15 = 2. 1 average cycles per instruction (CPI) = 2. 1 Multi-Cycle @ 30 MHz Single-Cycle @ 10 MHz 30 M cycles/sec 2. 1 cycles/instr ≈15 MIPS vs 10 MIPS = millions of instructions per second 800 MHz PIII “faster” than 1 GHz P 4

Example Goal: Make Multi-Cycle @ 30 MHz CPU (15 MIPS) run 2 x faster

Example Goal: Make Multi-Cycle @ 30 MHz CPU (15 MIPS) run 2 x faster by making arithmetic instructions faster Instruction mix (for P): • 25% load/store, CPI = 3 • 60% arithmetic, CPI = 2 • 15% branches, CPI = 1

Example Goal: Make Multi-Cycle @ 30 MHz CPU (15 MIPS) run 2 x faster

Example Goal: Make Multi-Cycle @ 30 MHz CPU (15 MIPS) run 2 x faster by making arithmetic instructions faster Instruction mix (for P): • 25% load/store, CPI = 3 • 60% arithmetic, CPI = 2 • 15% branches, CPI = 1

Amdahl’s Law Execution time after improvement = execution time affected by improvement amount of

Amdahl’s Law Execution time after improvement = execution time affected by improvement amount of improvement + execution time unaffected Or: Speedup is limited by popularity of improved feature Corollary: Make the common case fast Caveat: Law of diminishing returns

Administrivia Required: partner for group project Project 1 (PA 1) and Homework 2 (HW

Administrivia Required: partner for group project Project 1 (PA 1) and Homework 2 (HW 2) are both out PA 1 Design Doc and HW 2 due in one week, start early Work alone on HW 2, but in group for PA 1 Save your work! • Save often. Verify file is non-zero. Periodically save to Dropbox, email. • Beware of Mac. OSX 10. 5 (leopard) and 10. 6 (snow-leopard) Use your resources • Lab Section, Piazza. com, Office Hours, Homework Help Session, • Class notes, book, Sections, CSUGLab

Administrivia Check online syllabus/schedule • http: //www. cs. cornell. edu/Courses/CS 3410/2013 sp/schedule. html Slides

Administrivia Check online syllabus/schedule • http: //www. cs. cornell. edu/Courses/CS 3410/2013 sp/schedule. html Slides and Reading for lectures Office Hours Homework and Programming Assignments Prelims (in evenings): • Tuesday, February 26 th • Thursday, March 28 th • Thursday, April 25 th Schedule is subject to change

Collaboration, Late, Re-grading Policies “Black Board” Collaboration Policy • Can discuss approach together on

Collaboration, Late, Re-grading Policies “Black Board” Collaboration Policy • Can discuss approach together on a “black board” • Leave and write up solution independently • Do not copy solutions Late Policy • Each person has a total of four “slip days” • Max of two slip days for any individual assignment • Slip days deducted first for any late assignment, cannot selectively apply slip days • For projects, slip days are deducted from all partners • 20% deducted per day late after slip days are exhausted Regrade policy • Submit written request to lead TA, and lead TA will pick a different grader • Submit another written request, lead TA will regrade directly • Submit yet another written request for professor to regrade.

Pipelining See: P&H Chapter 4. 5

Pipelining See: P&H Chapter 4. 5

A Processor memory inst +4 register file +4 =? PC control offset new pc

A Processor memory inst +4 register file +4 =? PC control offset new pc alu target imm extend cmp addr din dout memory

A Processor memory inst register file alu +4 addr PC din control new pc

A Processor memory inst register file alu +4 addr PC din control new pc Instruction Fetch imm extend Instruction Decode dout memory compute jump/branch targets Execute Memory Write. Back

Basic Pipeline Five stage “RISC” load-store architecture 1. Instruction fetch (IF) – get instruction

Basic Pipeline Five stage “RISC” load-store architecture 1. Instruction fetch (IF) – get instruction from memory, increment PC 2. Instruction Decode (ID) – translate opcode into control signals and read registers 3. Execute (EX) – perform ALU operation, compute jump/branch targets 4. Memory (MEM) – access memory if needed 5. Writeback (WB) – update register file

Principles of Pipelined Implementation Break instructions across multiple clock cycles (five, in this case)

Principles of Pipelined Implementation Break instructions across multiple clock cycles (five, in this case) Design a separate stage for the execution performed during each clock cycle Add pipeline registers (flip-flops) to isolate signals between different stages