CSCE 513 Computer Architecture Lecture 3 Instruction Level

Overview Last Time n Overview: n Speed-up Power wall, ILP wall, to multicore n

Finish up Slides from Lecture 2 Slides 18 § CPU Performance Equation § Fallacies

Patterson’s 5 steps to design a processor 1. Analyze instruction set => datapath requirements

Components we are assuming you know Basic Gates n Ands, ors, xors, nands, nors

MIPS Simplifies Processor Design l Instructions same size l Source registers always in same

Register File Data in R 0 Notes 1. How big are the lines ?

Instruction Fetch (non-branch) – 8– CSCE 513 Fall 2017

High Level for Register-Register Instruct. – 9– CSCE 513 Fall 2017

Stores: Store Rs, disp(Rb) Notes l Sign extend for 16 bit immediates l Write

Loads LD rd, disp(Rr) Notes n – 11 – Sign extend for 16 bit

Branches Notes n Sign extend for backwards branches n n Note Shift left 2

Branch Hardware Inst Address n. PC_sel 4 Adder PC Mux Adder – 13 –

Adding Instruction Fetch / PC Increment – 14 – CSCE 513 Fall 2017

Simple Data Path for All Instructions – 15 – CSCE 513 Fall 2017

Pulling it All Together Notes n – 16 – Note PC=PC+4 (all MIPS instructions

Adding Control – 17 – CSCE 513 Fall 2017

Non-pipelined RISC operations Fig C. 21 Store 4 cycles (10%) CPI ? Branches 2

Multicycle Data Path (appendix C) Multicycle Data Path n Execute instructions in stages n

Stages of Classical 5 -stage pipeline Instruction Fetch Cycle n IR Mem[PC] n NPC

Execute Based on type of intsruction Memory Reference – calculate effective address d(rb) n

Memory n PC NPC Memory n n LMD Mem[ALUOutput], or Mem[ALUOutput] B Branch n

Write-back (WB) cycle Register-Register ALU instruction n Regs[rd] ALUOutput Register-Immediate ALU instruction n Regs[rt]

Simple RISC Pipeline Clock cycle number (time ) Instruction 1 2 Instruction n IF

Performance Analysis in Perfect World Assuming S stages in the pipeline. At each cycle

Implement Pipelines Supp. Fig C. 4 – 26 – CSCE 513 Fall 2017

Pipeline Example with a problem (A. 5 like) Instruction DADD R 1, R 2,

Inserting Pipeline Registers into Data Path fig A’. 18 – 28 – CSCE 513

Major Hurdle of Pipelining Consider executing the code below DADD R 1, R 2,

RISC Pipeline Problems Clock cycle number (time ) Instruction DADD R 1, R 2,

Hazards Data Hazards – a data value computed in one stage is not ready

Performance of Pipelines with Stalls Thus Pipelining can be thought of as improving CPI

Performance Equations with Stalls If we ignore overhead of pipelining Special Case: If we

Performance Equations with Stalls Alternatively focusing on improvement in CCT Then simplifying using formulas

Performance Equations with Stalls Then simplifying using formulas for CCTpipelined and We obtain –

Structural Hazards If a combination of instructions cannot be accommodated because of resource conflicts

Example Structural Hazard Fig C. 4 – 37 – CSCE 513 Fall 2017

Pipeline Stalled for Structural Hazard Clock cycle number (time ) Instruction 1 2 Instruction

Data Hazard Clock cycle number (time ) Instruction DADD R 1, R 2, R

Minimizing Data Hazard Stalls by Forwarding – 42 – CSCE 513 Fall 2017

Fig C. 7 Forwarding – 43 – CSCE 513 Fall 2017

Forward of operands for Stores C. 8 – 44 – CSCE 513 Fall 2017

Figure C. 9 (new slide) Data Forwarding Figure C. 9 The load instruction can

Logic to detect Hazards – 46 – CSCE 513 Fall 2017

Figure C. 23 Forwarding Paths – 47 – CSCE 513 Fall 2017

Forwarding Figure C. 26 Pipeline Reg. Source – 48 – Opcode of Source Pipeline

Pipeline Reg. Source – 49 – Opcode of Source Pipeline Reg. Destination Opcode of

Figure C. 23 Forwarding Paths – 50 – CSCE 513 Fall 2017

Load/Use Hazard – 51 – CSCE 513 Fall 2017

Control Hazrds – 52 – CSCE 513 Fall 2017

Figure C. 18 The states in a 2 -bit prediction scheme. By using 2

Pop Quiz Suppose that your application is 60% parallelizable what is the overall Speedup

Plan of attack Chapter reading plan Website § 1 • Lectures § Appendix C(pipeline

Slides: 55

Download presentation

CSCE 513 Computer Architecture Lecture 3 Instruction Level Parallelism (Pipelining) Topics n Execution time n ILP Readings: Appendix C September 6, 2017

Overview Last Time n Overview: n Speed-up Power wall, ILP wall, to multicore n n n Def Computer Architecture Lecture 1 slides 1 -29? New n Syllabus and other course pragmatics l Website (not shown) l Dates n n – 2– Figure 1. 9 Trends: CPUs, Memory, Network, Disk Why geometric mean? Speed-up again Amdahl’s Law CSCE 513 Fall 2017

Finish up Slides from Lecture 2 Slides 18 § CPU Performance Equation § Fallacies and Pitfalls § List of Appendices – 3– CSCE 513 Fall 2017

Patterson’s 5 steps to design a processor 1. Analyze instruction set => datapath requirements 2. Select set of data-path components & establish clock methodology 3. Assemble data-path meeting the requirements 4. Analyze implementation of each instruction to determine setting of control points that effects the register transfer. 5. Assemble the control logic – 4– CSCE 513 Fall 2017

Components we are assuming you know Basic Gates n Ands, ors, xors, nands, nors Combinational components n n n Adders ALU Multiplexers (MUXs) Decoders Not going to need: PLAs, PALs, FPGAs … Sequential Components n n n Registers Register file: banks of registers, pair of Mux, decoder for load lines Memories Register Transfer Language n – 5– Non. Branch: PC + 4 (control signal: register transfer) CSCE 513 Fall 2017

MIPS Simplifies Processor Design l Instructions same size l Source registers always in same place l Immediates are always same size, same location l Operations always on registers or immediates Single cycle data path means … l CPI is … l CCT is … l Reference http: //en. wikipedia. org/wiki/MIPS_architecture – 6– CSCE 513 Fall 2017 Ref: http: //en. wikipedia. org/wiki/MIPS_architecture

Register File Data in R 0 Notes 1. How big are the lines ? • • • n n n Some 5 Some 32 Some 1 R 31 Rd 5 x 32 decoder 32: 1 Mux Rt 2. Data-In goes to every register 3. R 0 = 0 Rs – 7– Bus A Bus B CSCE 513 Fall 2017

Instruction Fetch (non-branch) – 8– CSCE 513 Fall 2017

High Level for Register-Register Instruct. – 9– CSCE 513 Fall 2017

Stores: Store Rs, disp(Rb) Notes l Sign extend for 16 bit immediates l Write trace – 10 – l Read trace CSCE 513 Fall 2017

Loads LD rd, disp(Rr) Notes n – 11 – Sign extend for 16 bit (disp) to calculate address = disp + Rr CSCE 513 Fall 2017

Branches Notes n Sign extend for backwards branches n n Note Shift left 2 = Multiply by 4 which means displacement is in words Register Transfer Language Cond R[rs] == R[rt] n if (COND eq 0) n l PC + 4 + (SE(imm 16) x 4 ) n else l PC + 4 – 12 – CSCE 513 Fall 2017

Branch Hardware Inst Address n. PC_sel 4 Adder PC Mux Adder – 13 – PC Ext imm 16 Clk CSCE 513 Fall 2017

Adding Instruction Fetch / PC Increment – 14 – CSCE 513 Fall 2017

Simple Data Path for All Instructions – 15 – CSCE 513 Fall 2017

Pulling it All Together Notes n – 16 – Note PC=PC+4 (all MIPS instructions are 4 bytes) CSCE 513 Fall 2017

Adding Control – 17 – CSCE 513 Fall 2017

Non-pipelined RISC operations Fig C. 21 Store 4 cycles (10%) CPI ? Branches 2 cycles (12%) Others 5 cycles (78%) – 18 – CSCE 513 Fall 2017

Multicycle Data Path (appendix C) Multicycle Data Path n Execute instructions in stages n Shorter Clock Cycle Time (CCT) Executing an instruction takes a few cycles n l (how many stages we have) n We can execute different things in each stage at the same time; precursor to the pipelined version Stages n n n – 19 – Fetch Decode Execute Memory Write Back CSCE 513 Fall 2017

Stages of Classical 5 -stage pipeline Instruction Fetch Cycle n IR Mem[PC] n NPC PC + 4 Decode n n n A Regs[rs] B Imm sign-extend of Execute n . Memory n . Write Back – 20 – n . CSCE 513 Fall 2017

Execute Based on type of intsruction Memory Reference – calculate effective address d(rb) n ALUOutput A + Imm Register-Register ALU instruction n ALUOutput A func B Register-Immediate ALU instruction n ALUOutput A op Imm Branch n n – 21 – ALUOutput NPC + (Imm <<2) Cond (A==0) CSCE 513 Fall 2017

Memory n PC NPC Memory n n LMD Mem[ALUOutput], or Mem[ALUOutput] B Branch n – 22 – If (cond) PC ALUOutput CSCE 513 Fall 2017

Write-back (WB) cycle Register-Register ALU instruction n Regs[rd] ALUOutput Register-Immediate ALU instruction n Regs[rt] ALUOutput Load Instruction n – 23 – Regs[rt] LMD CSCE 513 Fall 2017

Simple RISC Pipeline Clock cycle number (time ) Instruction 1 2 Instruction n IF ID EX MEM WB IF ID Instruction n+1 Instruction n+2 Instruction n+3 Instruction n+4 – 24 – 3 4 5 6 7 8 9 EX MEM WB CSCE 513 Fall 2017

Performance Analysis in Perfect World Assuming S stages in the pipeline. At each cycle a new instruction is initiated. To execute N instructions takes: l N cycles to start-up instructions l (S-1) cycles to flush the pipeline l Total. Time = N + (S-1) Example for S=5 from previous slide N=100 instructions l Time to execute in non-pipelined = 100 * 5 = 500 cycles l Time to execute in pipelined version = 100 + (5 -1) = 104 cycles l Speed. Up = … – 25 – CSCE 513 Fall 2017

Implement Pipelines Supp. Fig C. 4 – 26 – CSCE 513 Fall 2017

Pipeline Example with a problem (A. 5 like) Instruction DADD R 1, R 2, R 3 DSUB R 4, R 1, R 5 AND R 6, R 1, R 7 OR R 8, R 1, R 9 XOR R 10, R 11 – 27 – 1 IM 2 ID 3 EX IM 4 6 DM 5 WB 7 8 ID EX DM WB IM ID EX DM 9 WB CSCE 513 Fall 2017

Inserting Pipeline Registers into Data Path fig A’. 18 – 28 – CSCE 513 Fall 2017

Major Hurdle of Pipelining Consider executing the code below DADD R 1, R 2, R 3 /* R 1 R 2 + R 3 */ DSUB R 4, R 1, R 5 /* R 4 R 1 + R 5 */ AND R 6, R 1, R 7 /* R 6 R 1 + R 7 */ OR R 8, R 1, R 9 /* R 8 R 1 | R 9 */ XOR R 10, R 11 /* R 10 R 1 ^ R 11 */ – 29 – CSCE 513 Fall 2017

RISC Pipeline Problems Clock cycle number (time ) Instruction DADD R 1, R 2, R 3 DSUB R 4, R 1, R 5 AND R 6, R 1, R 7 OR R 8, R 1, R 9 XOR R 10, R 11 1 IM 2 ID 3 EX IM 4 6 DM 5 WB 7 8 ID EX DM WB IM ID EX DM 9 WB So what’s the problem? – 30 – CSCE 513 Fall 2017

Hazards Data Hazards – a data value computed in one stage is not ready when it is needed in another stage of the pipeline Simple Solution: stall until it is ready but we can do better Control or Branch Hazards Structural Hazards – arise when resources are not sufficient to completely overlap instruction sequence e. g. two floating point add units then having to do three simultaneously – 31 – CSCE 513 Fall 2017

Performance of Pipelines with Stalls Thus Pipelining can be thought of as improving CPI or improving CCT. Relationships Equations – 32 – CSCE 513 Fall 2017

Performance Equations with Stalls If we ignore overhead of pipelining Special Case: If we assume every instruction takes same number of cycles, i. e. , CPI = constant and assume this constant is the depth of the pipeline then – 33 – CSCE 513 Fall 2017

Performance Equations with Stalls Alternatively focusing on improvement in CCT Then simplifying using formulas for CCTpipelined – 34 – CSCE 513 Fall 2017

Performance Equations with Stalls Then simplifying using formulas for CCTpipelined and We obtain – 35 – CSCE 513 Fall 2017

Structural Hazards If a combination of instructions cannot be accommodated because of resource conflicts is called a structural hazard. Examples l Single port memory (what is a dual port memory anyway? ) l One write port on register file l Single floating point adder l … A stall in pipeline frequently called a pipeline bubble or just bubble. A bubble floats through the pipeline occupying space – 36 – CSCE 513 Fall 2017

Example Structural Hazard Fig C. 4 – 37 – CSCE 513 Fall 2017

Pipeline Stalled for Structural Hazard Clock cycle number (time ) Instruction 1 2 Instruction n IF ID EX MEM WB IF ID EX IF ID Instruction n+1 Instruction n+2 Instruction n+3 Instruction n+4 3 4 stall 5 MEM* 6 7 8 9 WB EX MEM WB IF ID stall EX MEM WB IF ID EX MEM – a Memory cycle that is a load or Store MEM* – a Memory cycle that is not a load or Store – 38 – CSCE 513 Fall 2017

Data Hazards – 39 – CSCE 513 Fall 2017

Data Hazard Clock cycle number (time ) Instruction DADD R 1, R 2, R 3 DSUB R 4, R 1, R 5 AND R 6, R 1, R 7 OR R 8, R 1, R 9 XOR R 10, R 11 – 40 – 1 IM 2 ID 3 EX IM 4 6 DM 5 WB ID EX DM WB IM ID EX DM IM – instruction Memory, DM – data memory 7 8 9 WB CSCE 513 Fall 2017

Figure C. 6 – 41 – CSCE 513 Fall 2017

Minimizing Data Hazard Stalls by Forwarding – 42 – CSCE 513 Fall 2017

Fig C. 7 Forwarding – 43 – CSCE 513 Fall 2017

Forward of operands for Stores C. 8 – 44 – CSCE 513 Fall 2017

Figure C. 9 (new slide) Data Forwarding Figure C. 9 The load instruction can bypass its results to the AND and OR instructions, but not to the DSUB, since that would mean forwarding the result in “negative time. ” – 45 – CSCE 513 Fall 2017 Copyright © 2011, Elsevier Inc. All

Logic to detect Hazards – 46 – CSCE 513 Fall 2017

Figure C. 23 Forwarding Paths – 47 – CSCE 513 Fall 2017

Forwarding Figure C. 26 Pipeline Reg. Source – 48 – Opcode of Source Pipeline Reg. Destination Opcode of Destination Comparison of (if equal then forwarding forward ) Destination CSCE 513 Fall 2017

Pipeline Reg. Source – 49 – Opcode of Source Pipeline Reg. Destination Opcode of Destination Comparison of (if equal then forwarding forward ) Destination CSCE 513 Fall 2017

Figure C. 23 Forwarding Paths – 50 – CSCE 513 Fall 2017

Load/Use Hazard – 51 – CSCE 513 Fall 2017

Control Hazrds – 52 – CSCE 513 Fall 2017

Figure C. 18 The states in a 2 -bit prediction scheme. By using 2 bits rather than 1, a branch that strongly favors taken or not taken—as many branches do—will be mispredicted less often than with a 1 -bit predictor. The 2 bits are used to encode the four states in the system. The 2 -bit scheme is actually a specialization of a more general scheme that has an n-bit saturating counter for each entry in the prediction buffer. With an n-bit counter, the counter can take on values between 0 and 2 n – 1: When the counter is greater than or equal to one-half of its maximum value (2 n – 1), the branch is predicted as taken; otherwise, it is predicted as untaken. Studies of n-bit predictors have shown that the 2 -bit predictors do almost as well, thus most systems rely on 2 -bit branch predictors rather than the more general n-bit predictors. – 53 – Copyright © 2011, Elsevier Inc. All rights Reserved. CSCE 513 Fall 2017

Pop Quiz Suppose that your application is 60% parallelizable what is the overall Speedup in going from 1 core to 2? Assuming Power and Frequency are linearly related how is the Dynamic Power affected by the improvement? – 54 – CSCE 513 Fall 2017

Plan of attack Chapter reading plan Website § 1 • Lectures § Appendix C(pipeline review) • HW § Appendix B (Cache review) • Links § Chapter 2 (Memory Hierarchy) • Errata ? ? § Appendix A (ISA review not really) moodle § Chapter 3 (Instruction Level Parallelism ILP) • https: //dropbox. cse. sc. e du/ § Chapter 4 (Data level parallelism • CEC login/password § Chapter 5 (Thread level parallelism) Systems § Chapter 6 (Warehouse-scale computing) • Simplescalar - pipeline • Beowulf cluster - MPI • GTX - multithreaded CSCE 513 Fall 2017 § Sprinkle in other appendices – 55 –