Lecture on High Performance Processor Architecture CS 05162

  • Slides: 54
Download presentation
Lecture on High Performance Processor Architecture (CS 05162) Review of Instruction Sets, Pipelines An

Lecture on High Performance Processor Architecture (CS 05162) Review of Instruction Sets, Pipelines An Hong han@ustc. edu. cn Fall 2009 School of Computer Science and Technology University of Science and Technology of China 2021/12/19 USTC CS AN Hong 1

Outline n Quick review of everything you should have learned − Instruction Sets, Pipelines

Outline n Quick review of everything you should have learned − Instruction Sets, Pipelines 2021/12/19 USTC CS AN Hong 2

Quick review of everything you should have learned 2021/12/19 USTC CS AN Hong 3

Quick review of everything you should have learned 2021/12/19 USTC CS AN Hong 3

计算机体系结构定义: 经典, 狭义定义 temp = v[k]; High Level Language Program v[k] = v[k+1]; v[k+1]

计算机体系结构定义: 经典, 狭义定义 temp = v[k]; High Level Language Program v[k] = v[k+1]; v[k+1] = temp; Compiler lw $15, lw $16, sw $15, Assembly Language Program Assembler Machine Language Program 0000 1010 1100 0101 1001 1111 0110 1000 1100 0101 1010 0000 0($2) 4($2) 0110 1000 1111 1001 1010 0000 0101 1100 1111 1000 0110 0101 1100 0000 1010 1000 0110 1001 1111 Machine Interpretation Control Signal Specification ALUOP[0: 3] <= Inst. Reg[9: 11] & MASK ° ° 2021/12/19 USTC CS AN Hong 5

计算机体系结构定义: 经典, 狭义定义. . . the attributes of a [computing] system as seen by

计算机体系结构定义: 经典, 狭义定义. . . the attributes of a [computing] system as seen by the programmer, i. e. the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation. – Amdahl, Blaaw, and Brooks, 1964 2021/12/19 USTC CS AN Hong 6

现代计算机体系结构的广义定义 n 设计任务 − 特征(用途) − 成本(价格) − 性能 Application Operating System n 设计问题

现代计算机体系结构的广义定义 n 设计任务 − 特征(用途) − 成本(价格) − 性能 Application Operating System n 设计问题 − ISA设计 − 逻辑实现: l CPU设计, 存储系 统设计, 总线设 计 − 物理实现 l 集成电路设计, 封装, 电源设计, 冷却 2021/12/19 Compiler Firmware Instr. Set Proc. I/O system Datapath & Control Logic Design Circuit Design Layout USTC CS AN Hong Instruction Set Architecture 逻辑实现 (组成设计) 物理实现 (硬件设计) 8

计算机设计者的任务 2021/12/19 USTC CS AN Hong 9

计算机设计者的任务 2021/12/19 USTC CS AN Hong 9

Understanding Program Performance Hardware or software component How this component affects performance Where is

Understanding Program Performance Hardware or software component How this component affects performance Where is this topic covered? Algorithm Determine both the number of Book 1 source level statements and the number of I/O operations executed Programming Determine the number of machine language, compiler, instructions for each source level and architecture statement Book 2: chap ters 2 and 3 Processor and memory system Determine how fast instructions can be executed Book 2: chap ters 5, 6, 7 I/O system(hardware and operating system) Determine how fast I/O operations may be executed Book 2: chap ters 8 2021/12/19 USTC CS AN Hong 12

Text Books n Book 1: 陈国良,并行算法分析与设计, 高等教育出版社 n Book 2: David A. Patterson and

Text Books n Book 1: 陈国良,并行算法分析与设计, 高等教育出版社 n Book 2: David A. Patterson and John L. Hennessy, Computer Organization and Design: The Hardware/Software Interface, Third Edition. 机械 业出版 社. 2021/12/19 USTC CS AN Hong 13

CPI Computer Performance inst count CPU time = Seconds = Instructions x Program Inst

CPI Computer Performance inst count CPU time = Seconds = Instructions x Program Inst Count CPI Program X Compiler X (X) Inst. Set. X X Organization x Seconds Instruction Cycle Clock Rate X Technology 2021/12/19 Cycles Cycle time X X USTC CS AN Hong 14

Instruction Set Design software instruction set hardware Which is easier to change/design? ? ?

Instruction Set Design software instruction set hardware Which is easier to change/design? ? ? 2021/12/19 USTC CS AN Hong 16

指令集的演化 2021/12/19 USTC CS AN Hong 17

指令集的演化 2021/12/19 USTC CS AN Hong 17

Instruction Set Architecture: 每个周期做什么事? Stage 1: 从存储系统中获得指令 Instruction Fetch Stage 2 a: 确定做何动作 Stage

Instruction Set Architecture: 每个周期做什么事? Stage 1: 从存储系统中获得指令 Instruction Fetch Stage 2 a: 确定做何动作 Stage 2 b: 获得操作数 Instruction Decode Operand Fetch Stage 3: 产生运算结果或状态 Stage 4: 向存储系统中存放运算结果 Stage 5: 确定下一条要执行的指令 2021/12/19 USTC CS AN Hong Execute Result Store Next Instruction 18

Instruction Set Architecture: 必须定义什么? n Instruction Format or Encoding(编码方式) − how is it decoded?

Instruction Set Architecture: 必须定义什么? n Instruction Format or Encoding(编码方式) − how is it decoded? Instruction n Location of operands and result(寻址方式) − where other than memory? − how many explicit operands? − how are memory operands located? − which can or cannot be in memory? n Data type and Size(数据类型) n Operations(操作类型) Fetch Instruction Decode Operand Fetch Execute − what are supported n Successor instruction(控制流的显式表示) − jumps, conditions, branches 指令处理必须经过 fetch-decode-execute ! Result Store Next Instruction 2021/12/19 USTC CS AN Hong 19

指令集的操作: Top 10 80 x 86 Instructions Rank instruction Integer Average Percent total executed

指令集的操作: Top 10 80 x 86 Instructions Rank instruction Integer Average Percent total executed 1 load 22% 2 conditional branch 20% 3 compare 16% 4 store 12% 5 add 8% 6 and 6% 7 sub 5% 8 move register 4% 9 call 1% 10 return 1% Total 96% Simple instructions dominate instruction frequency 2021/12/19 USTC CS AN Hong 21

MIPS I Instruction Set Architecture n Registers − GPRs: 32 bits; R 0(=0), R

MIPS I Instruction Set Architecture n Registers − GPRs: 32 bits; R 0(=0), R 1, …, R 31 − FPRs: l 32 bits; F 0, …, F 31(Single Precision) l 64 bits; F 0(F 0, F 1), …, F 30(F 30, F 31)(Double Precision) − Multiply and Divide Registers, PC − Other Registers Multiply and Divide Registers General Purpose Registers 31 Floating Point Registers 31 0 r 1 f 0 : : r 30 r 31 2021/12/19 f 0 31 f 30 f 31 USTC CS AN Hong 0 HI 0 : : f 30 31 0 LO Program Counter 31 0 PC 23

Example: MIPS ( DLX) Register-Register 31 26 25 Op 21 20 Rs 1 16

Example: MIPS ( DLX) Register-Register 31 26 25 Op 21 20 Rs 1 16 15 Rs 2 11 10 6 5 Rd 0 Opx Register-Immediate 31 26 25 Op 21 20 Rs 1 16 15 Rd immediate 0 Branch 31 26 25 Op Rs 1 21 20 16 15 Rs 2/Opx immediate 0 Jump / Call 31 26 25 Op 2021/12/19 target USTC CS AN Hong 0 24

MIPS I Instruction Set Architecture: Addressing Modes All instructions 32 bits wide Register (direct)

MIPS I Instruction Set Architecture: Addressing Modes All instructions 32 bits wide Register (direct) 寄存器寻址 op rs rt rd register Immediate 立即值寻址 op rs rt immed Base+index 偏移寻址 op rs rt immed register PC relative PC相对寻址 2021/12/19 op rs rt Memory + immed PC USTC CS AN Hong Memory + 25

MIPS Addressing Modes/Instruction Formats: 例子 n ADD and SUB − add. U rd, rs,

MIPS Addressing Modes/Instruction Formats: 例子 n ADD and SUB − add. U rd, rs, rt − sub. U rd, rs, rt n OR Immediate: − ori rt, rs, imm 16 n LOAD and STORE Word − lw rt, offset(rs) − sw rt, offset(rs) 31 26 op 6 bits 31 rs 5 bits 26 rt 5 bits 21 n BRANCH: − beq rs, rt, offset op 6 bits 2021/12/19 USTC CS AN Hong rs 5 bits rd 5 bits 6 shamt 5 bits 0 funct 6 bits 0 imm 16 16 bits 16 rt 5 bits 11 16 21 rs 6 bits 16 21 26 op 31 21 5 bits 0 offset 16 bits 16 rt 5 bits 0 offset 16 bits 26

Datapath vs Control Datapath Controller signals Control Points n Datapath: Storage, FU, interconnect sufficient

Datapath vs Control Datapath Controller signals Control Points n Datapath: Storage, FU, interconnect sufficient to perform the desired functions − Inputs are Control Points − Outputs are signals n Controller: State machine to orchestrate operation on the data path − Based on desired function and signals 2021/12/19 USTC CS AN Hong 27

Datapath and Control: Split state diag into 5 pieces 取指 (IFetch) IR < Mem[PC];

Datapath and Control: Split state diag into 5 pieces 取指 (IFetch) IR < Mem[PC]; NPC <–PC+4; PC <– PC+4; A < R[rs]; B<– R[rt]; V< imm or offset< imm S <– A + B; R R型运算 S <– A or V; R imm型运算 S <– A + offset; R imm型Load M <– Mem[S] R[rd] <– S; 2021/12/19 R[rt] <– S; S <– A + offset; R imm型Store Mem[S] < B R[rt] <– M; USTC CS AN Hong 读寄存器/译码 (Reg/ID) S <– NPC+offset; Cond <– A op 0 执行 (Exec) If Cond PC <– S else PC <–NPC; 访存 (Mem) 写回 (WB) 28

5 Steps of DLX Datapath Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc

5 Steps of DLX Datapath Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Next SEQ PC Adder 4 L M D MUX Data Memory ALU Imm MUX RD Reg File Inst. Memory Address IR <= mem[PC]; Zero? RS 1 RS 2 Write Back MUX Next PC Memory Access Sign Extend PC <= PC + 4 Reg[IRrd] <= Reg[IRrs] op. IRop Reg[IRrt] 2021/12/19 WB Data USTC CS AN Hong 29

5 Steps of DLX Datapath Execute Addr. Calc Instr. Decode Reg. Fetch Next SEQ

5 Steps of DLX Datapath Execute Addr. Calc Instr. Decode Reg. Fetch Next SEQ PC Adder 4 Zero? RS 1 MUX MEM/WB Data Memory EX/MEM ALU MUX A <= Reg[IRrs]; Imm ID/EX Reg File IF/ID Inst. Memory Address RS 2 IR <= mem[PC]; PC <= PC + 4 Write Back MUX Next PC Memory Access WB Data Instruction Fetch Sign Extend RD RD RD B <= Reg[IRrt] rslt <= A op. IRop B WB <= rslt 2021/12/19 Reg[IRrd] <= WB USTC CS AN Hong 30

5 Steps of DLX Datapath Execute Addr. Calc Instr. Decode Reg. Fetch Next SEQ

5 Steps of DLX Datapath Execute Addr. Calc Instr. Decode Reg. Fetch Next SEQ PC Adder 4 Zero? RS 1 MUX MEM/WB Data Memory EX/MEM ALU MUX ID/EX Imm Reg File IF/ID Inst. Memory Address RS 2 Write Back MUX Next PC Memory Access WB Data Instruction Fetch Sign Extend RD RD RD • Data stationary control – local 2021/12/19 decode for each instruction phase / pipeline stage USTC CS AN Hong 31

Visualizing Pipelining Time (clock cycles) 2021/12/19 Ifetch DMem Reg ALU O r d e

Visualizing Pipelining Time (clock cycles) 2021/12/19 Ifetch DMem Reg ALU O r d e r Ifetch ALU I n s t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Reg Ifetch USTC CS AN Hong Reg DMem Reg 32

Pipelining is not quite that easy! n Limits to pipelining: Hazards prevent next instruction

Pipelining is not quite that easy! n Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle − Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) − Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) − Control hazards: Caused by delay between the fetching of instructions and decisions 2021/12/19 USTC CS AN Hong 33

Can pipelining get us into trouble? n Yes: Pipeline Hazards − structural hazards: attempt

Can pipelining get us into trouble? n Yes: Pipeline Hazards − structural hazards: attempt to use the same resource two different ways at the same time l E. g. , combined washer/dryer would be a structural hazard or folder busy doing something else (watching TV) − data hazards: attempt to use item before it is ready l E. g. , one sock of pair in dryer and one in washer; can’t fold until get sock from washer through dryer l instruction depends on result of prior instruction still in the pipeline − control hazards: attempt to make a decision before condition is evaulated l E. g. , washing football uniforms and need to get proper detergent level; need to see after dryer before next load in l branch instructions n Can always resolve hazards by waiting ? − pipeline control must detect the hazard − take action (or delay action) to resolve hazards 2021/12/19 USTC CS AN Hong 34

结构相关:由访存引起的结构相关 Time (clock cycles) Instr 4 Reg Mem Reg Mem Reg ALU Instr 3

结构相关:由访存引起的结构相关 Time (clock cycles) Instr 4 Reg Mem Reg Mem Reg ALU Instr 3 Reg ALU Instr 2 Mem ALU Instr 1 Reg ALU O r d e r Load Mem ALU I n s t r. Mem Reg 取指和取数都要访问同一个存储器 Detection is easy in this case! (right half highlight means read, left half write) 2021/12/19 USTC CS AN Hong 35

结构相关的解决方案:阻塞 Time (clock cycles) Instr 1 Reg Mem Reg Stall Mem Reg Instr 4

结构相关的解决方案:阻塞 Time (clock cycles) Instr 1 Reg Mem Reg Stall Mem Reg Instr 4 Reg Mem Reg ALU Instr 3 Mem ALU Instr 2 Reg ALU O r d e r Load Mem ALU I n s t r. Mem Reg 取指延迟一拍进行 2021/12/19 USTC CS AN Hong 36

控制相关: What’s the Problem? 例: BEQ rs, rt, offset if R[rs] == R[rt] then

控制相关: What’s the Problem? 例: BEQ rs, rt, offset if R[rs] == R[rt] then PC< PC+offset Need address here bne r 2, #0, r 3 NT add r 4, r 5, r 6 Instruction Fetch T sub r 7, r 8, r 9 Branch Delay Compute address here Decode Execute n分支处理问题可划分为两个子问题 −决定分支的方向(分支条件相关) −对需要跳转的分支,使执行延迟最小化 -> 尽快 获得转移的目标地址(分支地址相关) 2021/12/19 USTC CS AN Hong Memory Access Writeback 37

Control Hazard Solution #1: Stall Add Beq Reg Mem Lost potential Mem Reg ALU

Control Hazard Solution #1: Stall Add Beq Reg Mem Lost potential Mem Reg ALU Load Mem ALU O r d e r Time (clock cycles) ALU I n s t r. Mem Reg n Stall: wait until decision is clear n Impact: 2 lost cycles (i. e. 3 clock cycles per branch instruction) => slow n Move decision to end of decode − save 1 cycle per branch 2021/12/19 USTC CS AN Hong 38

Pipelined DLX Datapath Instruction Fetch Memory Access Write Back Adder MUX Next SEQ PC

Pipelined DLX Datapath Instruction Fetch Memory Access Write Back Adder MUX Next SEQ PC Next PC Zero? RS 1 MUX MEM/WB Data Memory EX/MEM ALU MUX Imm ID/EX Reg File IF/ID Memory Address RS 2 WB Data 4 Execute Addr. Calc Instr. Decode Reg. Fetch Sign Extend RD RD RD • Interplay of instruction set design and cycle time. 2021/12/19 USTC CS AN Hong 39

Control Hazard Solution #2: Predict Beq Load Reg Mem Reg ALU Add Mem ALU

Control Hazard Solution #2: Predict Beq Load Reg Mem Reg ALU Add Mem ALU O r d e r Time (clock cycles) ALU I n s t r. Mem Reg n Predict: guess one direction then back up if wrong n Impact: 0 lost cycles per branch instruction if right, 1 if wrong (right 50% of time) n More dynamic scheme: history of 1 branch ( 90%) 2021/12/19 USTC CS AN Hong 40

Control Hazard Solution #3: Delayed Branch Misc Load Mem Reg Mem Reg ALU Beq

Control Hazard Solution #3: Delayed Branch Misc Load Mem Reg Mem Reg ALU Beq Reg ALU Add Mem ALU O r d e r Time (clock cycles) ALU I n s t r. Mem Reg n Delayed Branch: Redefine branch behavior (takes place after next instruction) n Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot” ( 50% of time) n As launch more instruction per clock cycle, less useful 2021/12/19 USTC CS AN Hong 41

Data Hazard on R 1 n Read After Write (RAW) Instr. J tries to

Data Hazard on R 1 n Read After Write (RAW) Instr. J tries to read operand before Instr. I writes it I: add r 1, r 2, r 3 J: sub r 4, r 1, r 3 n Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. 2021/12/19 USTC CS AN Hong 42

Data Hazard on r 1: Read after write hazard (RAW) add r 1, r

Data Hazard on r 1: Read after write hazard (RAW) add r 1, r 2, r 3 sub r 4, r 1, r 3 and r 6, r 1, r 7 or r 8, r 1, r 9 xor r 10, r 11 2021/12/19 USTC CS AN Hong 43

Data Hazard on r 1: Read after write hazard (RAW) n Dependencies backwards in

Data Hazard on r 1: Read after write hazard (RAW) n Dependencies backwards in time are hazards Time (clock cycles) IF Reg Dm Im Reg ALU or r 8, r 1, r 9 xor r 10, r 11 2021/12/19 WB ALU and r 6, r 1, r 7 MEM ALU O r d e r sub r 4, r 1, r 3 Im EX ALU I n s t r. add r 1, r 2, r 3 ID/RF USTC CS AN Hong Reg Reg Dm Reg 44

Data Hazard Solution: Forwarding n “Forward” result from one stage to another Time (clock

Data Hazard Solution: Forwarding n “Forward” result from one stage to another Time (clock cycles) IF Reg Dm Im Reg ALU or r 8, r 1, r 9 xor r 10, r 11 2021/12/19 WB ALU and r 6, r 1, r 7 MEM ALU O r d e r sub r 4, r 1, r 3 Im EX ALU I n s t r. add r 1, r 2, r 3 ID/RF USTC CS AN Hong Reg Reg Dm Reg 45

HW Change for Forwarding Next. PC mux MEM/WR EX/MEM ALU mux ID/EX Registers Data

HW Change for Forwarding Next. PC mux MEM/WR EX/MEM ALU mux ID/EX Registers Data Memory mux Immediate What circuit detects and resolves this hazard? 2021/12/19 USTC CS AN Hong 46

Forwarding (or Bypassing): What about Loads? n Dependencies backwards in time are hazards Time

Forwarding (or Bypassing): What about Loads? n Dependencies backwards in time are hazards Time (clock cycles) IF MEM Reg Dm Im Reg ALU sub r 4, r 1, r 3 Im EX ALU lw r 1, 0(r 2) ID/RF WB Reg Dm Reg n Data Hazard Even with Forwarding n Can’t solve with forwarding ,Must delay/stall instruction dependent on loads 2021/12/19 USTC CS AN Hong 47

Forwarding (or Bypassing): What about Loads ? n Dependencies backwards in time are hazards

Forwarding (or Bypassing): What about Loads ? n Dependencies backwards in time are hazards Time (clock cycles) IF Reg Stall MEM WB Dm Reg Im Reg ALU sub r 4, r 1, r 3 Im EX ALU lw r 1, 0(r 2) ID/RF Dm Reg n Data Hazard Even with Forwarding n Can’t solve with forwarding ,Must delay/stall instruction dependent on loads 2021/12/19 USTC CS AN Hong 48

Software Scheduling to Avoid Load Hazards Try producing fast code for a = b

Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d , e, and f in memory. Slow code: LW LW ADD SW LW LW SUB SW Fast code: Rb, b Rc, c Ra, Rb, Rc a, Ra Re, e Rf, f Rd, Re, Rf d, Rd LW Rb, b LW Rc, c LW Re, e ADD Ra, Rb, Rc LW Rf, f SW a, Ra SUB Rd, Re, Rf SW d, Rd Compiler optimizes for performance. Hardware checks for safety. 2021/12/19 USTC CS AN Hong 49

Three Generic Data Hazards n Write After Read (WAR) Instr. J writes operand before

Three Generic Data Hazards n Write After Read (WAR) Instr. J writes operand before Instr. I reads it I: sub r 4, r 1, r 3 J: add r 1, r 2, r 3 K: mul r 6, r 1, r 7 n Called an “anti dependence” by compiler writers. This results from reuse of the name “r 1”. n Can’t happen in DLX 5 stage pipeline because: − All instructions take 5 stages, and − Reads are always in stage 2, and − Writes are always in stage 5 2021/12/19 USTC CS AN Hong 50

Three Generic Data Hazards n Write After Write (WAW) Instr. J writes operand before

Three Generic Data Hazards n Write After Write (WAW) Instr. J writes operand before Instr. I writes it. I: sub r 1, r 4, r 3 J: add r 1, r 2, r 3 K: mul r 6, r 1, r 7 n Called an “output dependence” by compiler writers This also results from the reuse of name “r 1”. n Can’t happen in DLX 5 stage pipeline because: − All instructions take 5 stages, and − Writes are always in stage 5 n Will see WAR and WAW in more complicated pipes 2021/12/19 USTC CS AN Hong 51