14 332 331 Computer Architecture and Assembly Language

14: 332: 331 Computer Architecture and Assembly Language Spring 2006 Week 12 Introduction to Pipelined Datapath [Adapted from Dave Patterson’s UCB CS 152 slides and Mary Jane Irwin’s PSU CSE 331 slides] 331 W 12. 1 Spring 2006

Review: Multicycle Data and Control Path 1 Memory Address Read Data (Instr. or Data) 1 1 Write Data 0 MDR Write Data 2 Shift left 2 28 2 0 1 zero ALU 4 0 Instr[15 -0] Sign Extend 32 Instr[5 -0] 331 W 12. 2 Shift left 2 Instr[25 -0] Read Addr 1 Register Read Addr 2 Data 1 File Write Addr Read IR PC Instr[31 -26] 0 PC[31 -28] 0 1 2 3 ALU control Spring 2006 ALUout Mem. Read Mem. Write Memto. Reg IRWrite PCSource ALUOp Control ALUSrc. B FSM ALUSrc. A Reg. Write Reg. Dst A Ior. D B PCWrite. Cond PCWrite

Review: RTL Summary Step Instr fetch Decode Execute Memory access Writeback 331 W 12. 3 R-type Mem Ref Branch Jump IR = Memory[PC]; PC = PC + 4; A = Reg[IR[25 -21]]; B = Reg[IR[20 -16]]; ALUOut = PC +(sign-extend(IR[15 -0])<< 2); ALUOut = A op B; ALUOut = A + sign-extend (IR[15 -0]); if (A==B) PC = ALUOut; PC = PC[31 -28] ||(IR[25 -0] << 2); Reg[IR[15 MDR = 11]] = Memory[ALUOut]; ALUOut; or Memory[ALUOut] = B; Reg[IR[20 -16]] = MDR; Spring 2006

Review: Multicycle Datapath FSM Decode Ior. D=0 Instr Fetch 1 Unless otherwise assigned ALUSrc. A=0 Mem. Read; IRWrite ALUSrc. B=11 Start ALUSrc. A=0 PCWrite, IRWrite, ALUOp=00 ALUsrc. B=01 Mem. Write, Reg. Write=0 PCWrite. Cond=0 PCSource, ALUOp=00 others=X ) PCWrite ) type eq R b ) w s = = (Op = j) lw or (Op 2 p = p O O ( ( 9 6 8 ALUSrc. A=1 ALUSrc. B=10 ALUSrc. B=00 PCSource=10 Execute ALUOp=00 ALUOp=01 ALUOp=10 PCWrite. Cond=0 PCSource=01 PCWrite. Cond=0 (Op PCWrite. Cond = (Op = lw) sw ) 0 3 Memory Access Mem. Read Ior. D=1 PCWrite. Cond=0 5 Mem. Write Ior. D=1 PCWrite. Cond=0 7 Reg. Dst=1 Reg. Write Memto. Reg=0 PCWrite. Cond=0 4 Reg. Dst=0 Reg. Write Memto. Reg=1 PCWrite. Cond=0 Write Back 331 W 12. 4 Spring 2006

Combinational control logic Outputs Review: FSM Implementation Op 5 Op 4 Op 3 Op 2 Op 1 Op 0 Inputs State Reg PCWrite. Cond Ior. D Mem. Read Mem. Write IRWrite Memto. Reg PCSource ALUOp ALUSource. B ALUSource. A Reg. Write Reg. Dst Next State Inst[31 -26] System Clock 331 W 12. 5 Spring 2006

Single Cycle Disadvantages & Advantages q Uses the clock cycle inefficiently – the clock cycle must be timed to accommodate the slowest instruction Cycle 1 Cycle 2 Clk Single Cycle Implementation: lw q sw Waste Is wasteful of area since some functional units must (e. g. , adders) be duplicated since they can not be shared during a clock cycle but q Is simple and easy to understand 331 W 12. 6 Spring 2006

Multicycle Advantages & Disadvantages q Uses the clock cycle efficiently – the clock cycle is timed to accommodate the slowest instruction step l l q balance the amount of work to be done in each step restrict each step to use only one major functional unit Multicycle implementations allow l functional units to be used more than once per instruction as long as they are used on different clock cycles l faster clock rates different instructions to take a different number of clock cycles l but q Requires additional internal state registers, muxes, and more complicated (FSM) control 331 W 12. 7 Spring 2006

The Five Stages of Load Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 lw IFetch Dec Exec Mem WB q IFetch: Instruction Fetch and Update PC q Dec: Registers Fetch and Instruction Decode q Exec: Execute R-type; calculate memory address q Mem: Read/write the data from/to the Data Memory q WB: Write the data back to the register file 331 W 12. 8 Spring 2006

Single Cycle vs. Multiple Cycle Timing Single Cycle Implementation: Cycle 1 Cycle 2 Clk lw sw multicycle clock slower than 1/5 th of single cycle clock due to stage flipflop overhead Multiple Cycle Implementation: Clk Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 lw IFetch 331 W 12. 9 Dec Exec Mem WB sw IFetch Dec Exec Mem R-type IFetch Spring 2006

Pipelined MIPS Processor q Start the next instruction while still working on the current one l improves throughput - total amount of work done in a given time Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 IFetch Dec lw IFetch Dec sw R-type l 331 W 12. 10 Exec Mem WB Exec Mem IFetch Dec WB instruction latency (execution time, delay time, response time) is not reduced - time from the start of an instruction to its completion Spring 2006

Single Cycle, Multiple Cycle, vs. Pipeline Single Cycle Implementation: Cycle 1 Cycle 2 Clk Load Store Waste Multiple Cycle Implementation: Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk lw IFetch Dec Exec Mem WB sw IFetch Dec Pipeline Implementation: lw IFetch sw 331 W 12. 11 Mem wasted cycle Dec Exec Mem WB IFetch Dec Exec Mem WB Dec Exec Mem R-type IFetch Exec R-type IFetch WB Spring 2006

Pipelining the MIPS ISA q What makes it easy l l q l memory operations can occur only in loads and stores l operands must be aligned in memory so a single data transfer requires only one memory access What makes it hard l l l 331 W 12. 12 all instructions are the same length (32 bits) few instruction formats (three) with symmetry across formats structural hazards: what if we had only one memory control hazards: what about branches data hazards: what if an instruction’s input operands depend on the output of a previous instruction Spring 2006

MIPS Pipeline Datapath Modifications q What do we need to add/modify in our MIPS datapath? l State registers between pipeline stages to isolate them IFetch Dec Exec Mem WB 1 0 Add Shift left 2 4 331 W 12. 13 16 Sign Extend Read Data 2 0 1 ALU Read Data Write Data Mem/WB File Write Address Exec/Mem Read Addr 2 Data 1 Write Data System Clock Data Memory Register Read Dec/Exec Read Address Read Addr 1 IFetch/Dec PC Instruction Memory Add 32 Spring 2006 1 0

MIPS Pipeline Control Path Modifications q All control signals are determined during Decode l and held in the state registers between pipeline stages IFetch Dec Exec Mem WB 1 0 Control Add Shift left 2 4 331 W 12. 14 16 Sign Extend Read Data 2 0 1 ALU Read Data Write Data Mem/WB File Write Address Exec/Mem Read Addr 2 Data 1 Write Data System Clock Data Memory Register Read Dec/Exec Read Address Read Addr 1 IFetch/Dec PC Instruction Memory Add 32 Spring 2006 1 0

Graphically Representing MIPS Pipeline q Reg ALU IM DM Reg Can help with answering questions like: l l l 331 W 12. 15 how many cycles does it take to execute this code? what is the ALU doing during cycle 4? is there a hazard, why does it occur, and how can it be fixed? Spring 2006

Why Pipeline? For Throughput! Time (clock cycles) IM Reg DM IM Reg ALU Inst 3 DM ALU Inst 2 Once the pipeline is full, one instruction is completed every cycle Reg ALU Inst 1 IM ALU O r d e r Inst 0 ALU I n s t r. Inst 4 Reg Reg DM Reg Time to fill the pipeline 331 W 12. 16 Spring 2006

Can pipelining get us into trouble? q Yes: Pipeline Hazards l l structural hazards: attempt to use the same resource by two different instructions at the same time data hazards: attempt to use item before it is ready - instruction depends on result of prior instruction still in the pipeline l control hazards: attempt to make a decision before condition is evaulated - branch instructions q Can always resolve hazards by waiting l l 331 W 12. 17 pipeline control must detect the hazard take action (or delay action) to resolve hazards Spring 2006

A Unified Memory Would Be a Structural Hazard Time (clock cycles) Inst 4 331 W 12. 18 Mem Reg Reg Mem Reg ALU Inst 3 Reg ALU Inst 2 Mem Reg ALU Inst 1 Reading data from memory Mem ALU O r d e r lw Reg ALU I n s t r. Mem Mem Reading instruction from memory Mem Reg Spring 2006

How About Register File Access? Time (clock cycles) Inst 4 331 W 12. 19 IM Reg DM IM Reg ALU add DM ALU Inst 2 Reg ALU Inst 1 IM ALU O r d e r add ALU I n s t r. Can fix register file access hazard by doing reads in the second half of the cycle and writes in the first half. Reg Reg DM Reg Spring 2006

Register Usage Can Cause Data Hazards q xor r 4, r 1, r 5 331 W 12. 20 IM Reg DM IM Reg ALU or r 8, r 1, r 9 DM ALU and r 6, r 1, r 7 Reg ALU sub r 4, r 1, r 5 IM ALU O r d e r add r 1, r 2, r 3 ALU I n s t r. Dependencies backward in time cause hazards Reg Reg DM Spring 2006 Reg

One Way to “Fix” a Data Hazard Reg DM Reg IM Reg DM IM Reg ALU IM ALU O r d e r add r 1, r 2, r 3 ALU I n s t r. Can fix data hazard by waiting – stall – but affects throughput stall sub r 4, r 1, r 5 and r 6, r 1, r 7 331 W 12. 21 Reg DM Spring 2006 Reg

Another Way to “Fix” a Data Hazard xor r 4, r 1, r 5 331 W 12. 22 Reg IM Reg DM IM Reg ALU or r 8, r 1, r 9 IM DM ALU and r 6, r 1, r 7 Reg ALU sub r 4, r 1, r 5 IM ALU O r d e r add r 1, r 2, r 3 ALU I n s t r. Can fix data hazard by forwarding results as soon as they are available to where they are needed Reg DM Reg Reg DM Spring 2006 Reg

Loads Can Cause Data Hazards q xor r 4, r 1, r 5 331 W 12. 23 IM Reg DM IM Reg ALU or r 8, r 1, r 9 DM ALU and r 6, r 1, r 7 Reg ALU sub r 4, r 1, r 5 IM ALU O r d e r lw r 1, 100(r 2) ALU I n s t r. Dependencies backward in time cause hazards Reg Reg DM Spring 2006 Reg

Stores Can Cause Data Hazards q xor r 4, r 1, r 5 331 W 12. 24 IM Reg DM IM Reg ALU or r 8, r 1, r 9 DM ALU and r 6, r 1, r 7 Reg ALU sw r 1, 100(r 5) IM ALU O r d e r add r 1, r 2, r 3 ALU I n s t r. Dependencies backward in time cause hazards Reg Reg DM Spring 2006 Reg

Forwarding with Load-use Data Hazards 331 W 12. 25 Spring 2006

Branch Instructions Cause Control Hazards q Inst 4 331 W 12. 26 IM Reg DM IM Reg ALU Inst 3 DM ALU lw Reg ALU beq IM ALU O r d e r add ALU I n s t r. Dependencies backward in time cause hazards Reg Reg DM Reg Spring 2006

One Way to “Fix” a Control Hazard DM IM Reg Can fix branch hazard by waiting – stall – but affects throughput DM Reg IM Reg DM IM Reg ALU beq IM ALU O r d e r add ALU I n s t r. stall lw Inst 3 331 W 12. 27 Reg DM Spring 2006 Reg

Other Pipeline Structures Are Possible q What about (slow) multiply operation? l let it take two cycles MUL q ALU IM Reg DM Reg What if the data memory access is twice as slow as the instruction memory? l l make the clock twice as slow or … let data memory access take two cycles (and keep the same clock rate) 331 W 12. 28 Reg ALU IM DM 1 DM 2 Reg Spring 2006

Sample Pipeline Alternatives q ARM 7 IM Reg PC update IM access q XScale IM IM 1 PC update BTB access start IM access Reg IM 2 DM Reg SHFT decode reg 1 access IM access 331 W 12. 29 ALU op DM access shift/rotate commit result (write back) ALU Strong. ARM-1 decode reg access ALU q EX DM 1 Reg DM 2 DM write reg write start DM access exception ALU op shift/rotate reg 2 access Spring 2006

Summary q All modern day processors use pipelining q Pipelining doesn’t help latency of single task, it helps throughput of entire workload l Multiple tasks operating simultaneously using different resources q Potential speedup = Number of pipe stages q Pipeline rate limited by slowest pipeline stage l l q Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup Must detect and resolve hazards l 331 W 12. 30 Stalling negatively affects throughput Spring 2006

Performance q Purchasing l given a collection of machines, which has the - q Design l perspective best performance ? least cost ? best performance / cost ? perspective faced with design options, which has the - q Both best performance improvement ? least cost ? best performance / cost ? require l basis for comparison l metric for evaluation q Our goal is to understand cost & performance implications of architectural choices 331 W 12. 31 Spring 2006

Two notions of “performance” Plane DC to Paris Speed Passengers Throughput (pmph) Boeing 747 6. 5 hours 610 mph 470 286, 700 BAD/Sud Concodre 3 hours 1350 mph 132 178, 200 Which has higher performance? ° Time to do the task (Execution Time) – execution time, response time, latency ° Tasks per day, hour, week, sec, ns. . . (Performance) – throughput, bandwidth Response time and throughput often are in opposition 331 W 12. 32 Spring 2006

Definitions q. Performance l bigger q. If is in units of things-per-second is better we are primarily concerned with response time l performance(x) = 1 execution_time(x) " X is n times faster than Y" means Performance(X) n = ----------- Performance(Y) 331 W 12. 33 Spring 2006

Example q Time of Concorde vs. Boeing 747? • Concord is 1350 mph / 610 mph = 2. 2 times faster • = 6. 5 hours / 3 hours • Throughput of Concorde vs. Boeing 747 ? • Concord is 178, 200 pmph / 286, 700 pmph = 0. 62 “times faster” • Boeing is 286, 700 pmph / 178, 200 pmph = 1. 6 “times faster” • Boeing is 1. 6 times (“ 60%”)faster in terms of throughput • Concord is 2. 2 times (“ 120%”) faster in terms of flying time • We will focus primarily on execution time for a single job 331 W 12. 34 Spring 2006

Basis of Evaluation Cons Pros • representative • portable • widely used • improvements useful in reality • easy to run, early in design cycle • identify peak capability and potential bottlenecks 331 W 12. 35 Actual Target Workload Full Application Benchmarks Small “Kernel” Benchmarks Microbenchmarks • very specific • non-portable • difficult to run, or measure • hard to identify cause • less representative • easy to “fool” • “peak” may be a long way from application performance Spring 2006

SPEC 95 q. Eighteen application benchmarks (with inputs) reflecting a technical computing workload q. Eight integer l go, q. Ten m 88 ksim, gcc, compress, li, ijpeg, perl, vortex floating-point intensive l tomcatv, swim, su 2 cor, hydro 2 d, mgrid, applu, turb 3 d, apsi, fppp, wave 5 q. Must run with standard compiler flags l eliminate special undocumented incantations that may not even generate working code for real programs 331 W 12. 36 Spring 2006

Metrics of performance Answers per month Application Useful Operations per second Programming Language Compiler ISA (millions) of Instructions per second – MIPS (millions) of (F. P. ) operations per second – MFLOP/s Datapath Control Megabytes per second Function Units Transistors Wires Pins Cycles per second (clock rate) Each metric has a place and a purpose, and each can be misused 331 W 12. 37 Spring 2006

Aspects of CPU Performance CPU time = Seconds = Instructions x Cycles Program instr. count Program CPI Instruction x Seconds Cycle clock rate Program Compiler Instr. Set Arch. Organization Technology 331 W 12. 38 Spring 2006

CPI “Average cycles per instruction” CPI = (CPU Time * Clock Rate) / Instruction Count = Clock Cycles / Instruction Count n CPU time = Clock. Cycle. Time * � CPI * I i i i =1 n CPI = �CPI i =1 * i F i where F i = I i Instruction Count "instruction frequency" Invest Resources where time is Spent! 331 W 12. 39 Spring 2006

Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles CPI(i) ALU 50% 1. 5 Load 20% 5 1. 0 Store 10% 3. 3 Branch 20% 2. 4 2. 2 % Time 23% 45% 14% 18% Typical Mix How much faster would the machine be is a better data cache reduced the average load time to 2 cycles? How does this compare with using branch prediction to shave a cycle off the branch time? What if two ALU instructions could be executed at once? 331 W 12. 40 Spring 2006

Amdahl's Law Speedup due to enhancement E: Ex. Time w/o E Speedup(E) = ----------Ex. Time w/ E Performance w/ E = ----------Performance w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then, Ex. Time(with E) = ((1 -F) + F/S) X Ex. Time(without E) Speedup(with E) = 331 W 12. 41 1 (1 -F) + F/S Spring 2006

Summary: Evaluating Instruction Sets? Design-time metrics: ° Can it be implemented, in how long, at what cost? ° Can it be programmed? Ease of compilation? Static Metrics: ° How many bytes does the program occupy in memory? Dynamic Metrics: ° How many instructions are executed? ° How many bytes does the processor fetch to execute the program? CPI ° How many clocks are required per instruction? ° How "lean" a clock is practical? Best Metric: Time to execute the program! Inst. Count Cycle Time NOTE: this depends on instructions set, processor organization, and compilation techniques. 331 W 12. 42 Spring 2006