CSCE 513 Computer Architecture Lecture 3 Instruction Level



















![Stages of Classical 5 -stage pipeline Instruction Fetch Cycle n IR Mem[PC] n NPC Stages of Classical 5 -stage pipeline Instruction Fetch Cycle n IR Mem[PC] n NPC](https://slidetodoc.com/presentation_image_h2/4b57f31755c8792f390e47c55193d268/image-20.jpg)

![Memory n PC NPC Memory n n LMD Mem[ALUOutput], or Mem[ALUOutput] B Branch n Memory n PC NPC Memory n n LMD Mem[ALUOutput], or Mem[ALUOutput] B Branch n](https://slidetodoc.com/presentation_image_h2/4b57f31755c8792f390e47c55193d268/image-22.jpg)
![Write-back (WB) cycle Register-Register ALU instruction n Regs[rd] ALUOutput Register-Immediate ALU instruction n Regs[rt] Write-back (WB) cycle Register-Register ALU instruction n Regs[rd] ALUOutput Register-Immediate ALU instruction n Regs[rt]](https://slidetodoc.com/presentation_image_h2/4b57f31755c8792f390e47c55193d268/image-23.jpg)
































- Slides: 55
CSCE 513 Computer Architecture Lecture 3 Instruction Level Parallelism (Pipelining) Topics n Execution time n ILP Readings: Appendix C September 6, 2017
Overview Last Time n Overview: n Speed-up Power wall, ILP wall, to multicore n n n Def Computer Architecture Lecture 1 slides 1 -29? New n Syllabus and other course pragmatics l Website (not shown) l Dates n n – 2– Figure 1. 9 Trends: CPUs, Memory, Network, Disk Why geometric mean? Speed-up again Amdahl’s Law CSCE 513 Fall 2017
Finish up Slides from Lecture 2 Slides 18 § CPU Performance Equation § Fallacies and Pitfalls § List of Appendices – 3– CSCE 513 Fall 2017
Patterson’s 5 steps to design a processor 1. Analyze instruction set => datapath requirements 2. Select set of data-path components & establish clock methodology 3. Assemble data-path meeting the requirements 4. Analyze implementation of each instruction to determine setting of control points that effects the register transfer. 5. Assemble the control logic – 4– CSCE 513 Fall 2017
Components we are assuming you know Basic Gates n Ands, ors, xors, nands, nors Combinational components n n n Adders ALU Multiplexers (MUXs) Decoders Not going to need: PLAs, PALs, FPGAs … Sequential Components n n n Registers Register file: banks of registers, pair of Mux, decoder for load lines Memories Register Transfer Language n – 5– Non. Branch: PC + 4 (control signal: register transfer) CSCE 513 Fall 2017
MIPS Simplifies Processor Design l Instructions same size l Source registers always in same place l Immediates are always same size, same location l Operations always on registers or immediates Single cycle data path means … l CPI is … l CCT is … l Reference http: //en. wikipedia. org/wiki/MIPS_architecture – 6– CSCE 513 Fall 2017 Ref: http: //en. wikipedia. org/wiki/MIPS_architecture
Register File Data in R 0 Notes 1. How big are the lines ? • • • n n n Some 5 Some 32 Some 1 R 31 Rd 5 x 32 decoder 32: 1 Mux Rt 2. Data-In goes to every register 3. R 0 = 0 Rs – 7– Bus A Bus B CSCE 513 Fall 2017
Instruction Fetch (non-branch) – 8– CSCE 513 Fall 2017
High Level for Register-Register Instruct. – 9– CSCE 513 Fall 2017
Stores: Store Rs, disp(Rb) Notes l Sign extend for 16 bit immediates l Write trace – 10 – l Read trace CSCE 513 Fall 2017
Loads LD rd, disp(Rr) Notes n – 11 – Sign extend for 16 bit (disp) to calculate address = disp + Rr CSCE 513 Fall 2017
Branches Notes n Sign extend for backwards branches n n Note Shift left 2 = Multiply by 4 which means displacement is in words Register Transfer Language Cond R[rs] == R[rt] n if (COND eq 0) n l PC + 4 + (SE(imm 16) x 4 ) n else l PC + 4 – 12 – CSCE 513 Fall 2017
Branch Hardware Inst Address n. PC_sel 4 Adder PC Mux Adder – 13 – PC Ext imm 16 Clk CSCE 513 Fall 2017
Adding Instruction Fetch / PC Increment – 14 – CSCE 513 Fall 2017
Simple Data Path for All Instructions – 15 – CSCE 513 Fall 2017
Pulling it All Together Notes n – 16 – Note PC=PC+4 (all MIPS instructions are 4 bytes) CSCE 513 Fall 2017
Adding Control – 17 – CSCE 513 Fall 2017
Non-pipelined RISC operations Fig C. 21 Store 4 cycles (10%) CPI ? Branches 2 cycles (12%) Others 5 cycles (78%) – 18 – CSCE 513 Fall 2017
Multicycle Data Path (appendix C) Multicycle Data Path n Execute instructions in stages n Shorter Clock Cycle Time (CCT) Executing an instruction takes a few cycles n l (how many stages we have) n We can execute different things in each stage at the same time; precursor to the pipelined version Stages n n n – 19 – Fetch Decode Execute Memory Write Back CSCE 513 Fall 2017
Stages of Classical 5 -stage pipeline Instruction Fetch Cycle n IR Mem[PC] n NPC PC + 4 Decode n n n A Regs[rs] B Imm sign-extend of Execute n . Memory n . Write Back – 20 – n . CSCE 513 Fall 2017
Execute Based on type of intsruction Memory Reference – calculate effective address d(rb) n ALUOutput A + Imm Register-Register ALU instruction n ALUOutput A func B Register-Immediate ALU instruction n ALUOutput A op Imm Branch n n – 21 – ALUOutput NPC + (Imm <<2) Cond (A==0) CSCE 513 Fall 2017
Memory n PC NPC Memory n n LMD Mem[ALUOutput], or Mem[ALUOutput] B Branch n – 22 – If (cond) PC ALUOutput CSCE 513 Fall 2017
Write-back (WB) cycle Register-Register ALU instruction n Regs[rd] ALUOutput Register-Immediate ALU instruction n Regs[rt] ALUOutput Load Instruction n – 23 – Regs[rt] LMD CSCE 513 Fall 2017
Simple RISC Pipeline Clock cycle number (time ) Instruction 1 2 Instruction n IF ID EX MEM WB IF ID Instruction n+1 Instruction n+2 Instruction n+3 Instruction n+4 – 24 – 3 4 5 6 7 8 9 EX MEM WB CSCE 513 Fall 2017
Performance Analysis in Perfect World Assuming S stages in the pipeline. At each cycle a new instruction is initiated. To execute N instructions takes: l N cycles to start-up instructions l (S-1) cycles to flush the pipeline l Total. Time = N + (S-1) Example for S=5 from previous slide N=100 instructions l Time to execute in non-pipelined = 100 * 5 = 500 cycles l Time to execute in pipelined version = 100 + (5 -1) = 104 cycles l Speed. Up = … – 25 – CSCE 513 Fall 2017
Implement Pipelines Supp. Fig C. 4 – 26 – CSCE 513 Fall 2017
Pipeline Example with a problem (A. 5 like) Instruction DADD R 1, R 2, R 3 DSUB R 4, R 1, R 5 AND R 6, R 1, R 7 OR R 8, R 1, R 9 XOR R 10, R 11 – 27 – 1 IM 2 ID 3 EX IM 4 6 DM 5 WB 7 8 ID EX DM WB IM ID EX DM 9 WB CSCE 513 Fall 2017
Inserting Pipeline Registers into Data Path fig A’. 18 – 28 – CSCE 513 Fall 2017
Major Hurdle of Pipelining Consider executing the code below DADD R 1, R 2, R 3 /* R 1 R 2 + R 3 */ DSUB R 4, R 1, R 5 /* R 4 R 1 + R 5 */ AND R 6, R 1, R 7 /* R 6 R 1 + R 7 */ OR R 8, R 1, R 9 /* R 8 R 1 | R 9 */ XOR R 10, R 11 /* R 10 R 1 ^ R 11 */ – 29 – CSCE 513 Fall 2017
RISC Pipeline Problems Clock cycle number (time ) Instruction DADD R 1, R 2, R 3 DSUB R 4, R 1, R 5 AND R 6, R 1, R 7 OR R 8, R 1, R 9 XOR R 10, R 11 1 IM 2 ID 3 EX IM 4 6 DM 5 WB 7 8 ID EX DM WB IM ID EX DM 9 WB So what’s the problem? – 30 – CSCE 513 Fall 2017
Hazards Data Hazards – a data value computed in one stage is not ready when it is needed in another stage of the pipeline Simple Solution: stall until it is ready but we can do better Control or Branch Hazards Structural Hazards – arise when resources are not sufficient to completely overlap instruction sequence e. g. two floating point add units then having to do three simultaneously – 31 – CSCE 513 Fall 2017
Performance of Pipelines with Stalls Thus Pipelining can be thought of as improving CPI or improving CCT. Relationships Equations – 32 – CSCE 513 Fall 2017
Performance Equations with Stalls If we ignore overhead of pipelining Special Case: If we assume every instruction takes same number of cycles, i. e. , CPI = constant and assume this constant is the depth of the pipeline then – 33 – CSCE 513 Fall 2017
Performance Equations with Stalls Alternatively focusing on improvement in CCT Then simplifying using formulas for CCTpipelined – 34 – CSCE 513 Fall 2017
Performance Equations with Stalls Then simplifying using formulas for CCTpipelined and We obtain – 35 – CSCE 513 Fall 2017
Structural Hazards If a combination of instructions cannot be accommodated because of resource conflicts is called a structural hazard. Examples l Single port memory (what is a dual port memory anyway? ) l One write port on register file l Single floating point adder l … A stall in pipeline frequently called a pipeline bubble or just bubble. A bubble floats through the pipeline occupying space – 36 – CSCE 513 Fall 2017
Example Structural Hazard Fig C. 4 – 37 – CSCE 513 Fall 2017
Pipeline Stalled for Structural Hazard Clock cycle number (time ) Instruction 1 2 Instruction n IF ID EX MEM WB IF ID EX IF ID Instruction n+1 Instruction n+2 Instruction n+3 Instruction n+4 3 4 stall 5 MEM* 6 7 8 9 WB EX MEM WB IF ID stall EX MEM WB IF ID EX MEM – a Memory cycle that is a load or Store MEM* – a Memory cycle that is not a load or Store – 38 – CSCE 513 Fall 2017
Data Hazards – 39 – CSCE 513 Fall 2017
Data Hazard Clock cycle number (time ) Instruction DADD R 1, R 2, R 3 DSUB R 4, R 1, R 5 AND R 6, R 1, R 7 OR R 8, R 1, R 9 XOR R 10, R 11 – 40 – 1 IM 2 ID 3 EX IM 4 6 DM 5 WB ID EX DM WB IM ID EX DM IM – instruction Memory, DM – data memory 7 8 9 WB CSCE 513 Fall 2017
Figure C. 6 – 41 – CSCE 513 Fall 2017
Minimizing Data Hazard Stalls by Forwarding – 42 – CSCE 513 Fall 2017
Fig C. 7 Forwarding – 43 – CSCE 513 Fall 2017
Forward of operands for Stores C. 8 – 44 – CSCE 513 Fall 2017
Figure C. 9 (new slide) Data Forwarding Figure C. 9 The load instruction can bypass its results to the AND and OR instructions, but not to the DSUB, since that would mean forwarding the result in “negative time. ” – 45 – CSCE 513 Fall 2017 Copyright © 2011, Elsevier Inc. All
Logic to detect Hazards – 46 – CSCE 513 Fall 2017
Figure C. 23 Forwarding Paths – 47 – CSCE 513 Fall 2017
Forwarding Figure C. 26 Pipeline Reg. Source – 48 – Opcode of Source Pipeline Reg. Destination Opcode of Destination Comparison of (if equal then forwarding forward ) Destination CSCE 513 Fall 2017
Pipeline Reg. Source – 49 – Opcode of Source Pipeline Reg. Destination Opcode of Destination Comparison of (if equal then forwarding forward ) Destination CSCE 513 Fall 2017
Figure C. 23 Forwarding Paths – 50 – CSCE 513 Fall 2017
Load/Use Hazard – 51 – CSCE 513 Fall 2017
Control Hazrds – 52 – CSCE 513 Fall 2017
Figure C. 18 The states in a 2 -bit prediction scheme. By using 2 bits rather than 1, a branch that strongly favors taken or not taken—as many branches do—will be mispredicted less often than with a 1 -bit predictor. The 2 bits are used to encode the four states in the system. The 2 -bit scheme is actually a specialization of a more general scheme that has an n-bit saturating counter for each entry in the prediction buffer. With an n-bit counter, the counter can take on values between 0 and 2 n – 1: When the counter is greater than or equal to one-half of its maximum value (2 n – 1), the branch is predicted as taken; otherwise, it is predicted as untaken. Studies of n-bit predictors have shown that the 2 -bit predictors do almost as well, thus most systems rely on 2 -bit branch predictors rather than the more general n-bit predictors. – 53 – Copyright © 2011, Elsevier Inc. All rights Reserved. CSCE 513 Fall 2017
Pop Quiz Suppose that your application is 60% parallelizable what is the overall Speedup in going from 1 core to 2? Assuming Power and Frequency are linearly related how is the Dynamic Power affected by the improvement? – 54 – CSCE 513 Fall 2017
Plan of attack Chapter reading plan Website § 1 • Lectures § Appendix C(pipeline review) • HW § Appendix B (Cache review) • Links § Chapter 2 (Memory Hierarchy) • Errata ? ? § Appendix A (ISA review not really) moodle § Chapter 3 (Instruction Level Parallelism ILP) • https: //dropbox. cse. sc. e du/ § Chapter 4 (Data level parallelism • CEC login/password § Chapter 5 (Thread level parallelism) Systems § Chapter 6 (Warehouse-scale computing) • Simplescalar - pipeline • Beowulf cluster - MPI • GTX - multithreaded CSCE 513 Fall 2017 § Sprinkle in other appendices – 55 –