CS 162 Computer Architecture Lecture 2 Introduction Pipelining




















- Slides: 20

CS 162 Computer Architecture Lecture 2: Introduction & Pipelining Instructor: L. N. Bhuyan www. cs. ucr. edu/~bhuyan/cs 162 1 1999 ©UCB

Review of Last Class ° MIPS Datapath ° Introduction to Pipelining ° Introduction to Instruction Level Parallelism (ILP) ° Introduction to VLIW 2 1999 ©UCB

What is Multiprocessing ° Parallelism at the Instruction Level is limited because of data dependency => Speed up is limited!! ° Abundant availability of program level parallelism, like Do I = 1000, Loop Level Parallelism. How about employing multiple processors to execute the loops => Parallel processing or Multiprocessing ° With billion transistors on a chip, we can put a few CPUs in one chip => Chip multiprocessor 3 1999 ©UCB

Memory Latency Problem Even if we increase CPU power, memory is the real bottleneck. Techniques to alleviate memory latency problem: 1. Memory hierarchy – Program locality, cache memory, multilevel, pages and context switching 2. Prefetching – Get the instruction/data before the CPU needs. Good for instns because of sequential locality, so all modern processors use prefetch buffers for instns. What do with data? 3. Multithreading – Can the CPU jump to another program when accessing memory? It’s like multiprogramming!! 4 1999 ©UCB

Hardware Multithreading ° We need to develop a hardware multithreading technique because switching between threads in software is very time-consuming (Why? ), so not suitable for main memory (instead of I/O) access, Ex: Multitasking ° Develop multiple PCs and register sets on the CPU so that thread switching can occur without having to store the register contents in main memory (stack, like it is done for context switching). ° Several threads reside in the CPU simultaneously, and execution switches between the threads on main memory access. ° How about both multiprocessors and multithreading on a chip? => Network Processor 5 1999 ©UCB

Time (processor cycle) Architectural Comparisons (cont. ) 6 Superscalar Simultaneous Fine-Grained. Coarse-Grained. Multiprocessing Multithreading Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot 1999 ©UCB

Intel IXP 1200 Network Processor n Initial component of the Intel Exchange Architecture - IXA n Each micro engine is a 5 -stage pipeline – no ILP, 4 -way multithreaded n 7 core multiprocessing – 6 Micro engines and a Strong Arm Core n 166 q MHz fundamental clock rate Intel claims 2. 5 Mpps IP routing for 64 byte packets n Already q 7 the most widely used NPU Or more accurately the most widely admitted use 1999 ©UCB

IXP 1200 Chip Layout n Strong. ARM core processing n Microengines introduce new ISA n I/O q PCI q SDRAM q SRAM q IX : PCI-like packet bus n On q 8 chip FIFOs 16 entry 64 B each 1999 ©UCB

IXP 1200 Microengine n 4 hardware contexts q q n n Single issue processor Explicit optional context switch on SRAM access Registers q All are single ported q Separate GPR q 1536 registers total 32 -bit ALU q Can access GPR or XFER registers n Standard 5 stage pipe n 4 KB SRAM instruction store – not a cache! 9 1999 ©UCB

Intel IXP 2400 Microengine (New) n XScale core replaces Strong. ARM n 1. 4 GHz target in 0. 13 -micron n Nearest neighbor routes added between microengines n Hardware to accelerate CRC operations and Random number generation n 16 entry CAM 10 1999 ©UCB

MIPS Pipeline Chapter 6 CS 161 Text 11 1999 ©UCB

Review: Single-cycle Datapath for MIPS Stage 5 PC Instruction Memory (Imem) Stage 1 Registers Stage 2 ALU Stage 3 Data Memory (Dmem) Stage 4 ° Use datapath figure to represent pipeline IFtch Dcd Exec Mem WB 12 Reg ALU IM DM Reg 1999 ©UCB

Stages of Execution in Pipelined MIPS 5 stage instruction pipeline 1) I-fetch: Fetch Instruction, Increment PC 2) Decode: Instruction, Read Registers 3) Execute: Mem-reference: Calculate Address R-format: Perform ALU Operation 4) Memory: Load: Read Data from Data Memory Store: Write Data to Data Memory 5) Write Back: Write Data to Register 13 1999 ©UCB

Pipelined Execution Representation Time IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB Program Flow ° To simplify pipeline, every instruction takes same number of steps, called stages 14 ° One clock cycle per stage 1999 ©UCB

Datapath Timing: Single-cycle vs. Pipelined ° Assume the following delays for major functional units: • 2 ns for a memory access or ALU operation • 1 ns for register file read or write ° Total datapath delay for single-cycle: 15 Insn Type Insn Fetch Reg Read ALU Oper beq R-form sw lw 2 ns 2 ns 1 ns 1 ns 2 ns 2 ns Data Reg Access Write 2 ns 1 ns Total Time 5 ns 6 ns 7 ns 8 ns ° In pipeline machine, each stage = length of longest delay = 2 ns; 5 stages = 10 ns 1999 ©UCB

Pipelining Lessons ° Pipelining doesn’t help latency (execution time) of single task, it helps throughput of entire workload ° Multiple tasks operating simultaneously using different resources ° Potential speedup = Number of pipe stages ° Time to “fill” pipeline and time to “drain” it reduces speedup ° Pipeline rate limited by slowest pipeline stage ° Unbalanced lengths of pipe stages also reduces speedup 16 1999 ©UCB

Single Cycle Datapath (From Ch 5) 4 P C Read Addr 31: 0 Instruction Imem 15: 11 Reg. Dst a d d 25: 21 20: 16 M u x Read data 1 Reg 2 Read Write data 2 Regs Reg. Write 15: 0 17 << 2 Read Reg 1 Write Data a d d Sign Extend M u x PCSrc Mem. Write A L U Read data Zero Address ALUcon ALUsrc Dmem Write Data Mem. Read ALUOp Mem. To. Reg M u x 1999 ©UCB

Required Changes to Datapath ° Introduce registers to separate 5 stages by putting IF/ID, ID/EX, EX/MEM, and MEM/WB registers in the datapath. ° Next PC value is computed in the 3 rd step, but we need to bring in next instn in the next cycle – Move PCSrc Mux to 1 st stage. The PC is incremented unless there is a new branch address. ° Branch address is computed in 3 rd stage. With pipeline, the PC value has changed! Must carry the PC value along with instn. Width of IF/ID register = (IR)+(PC) = 64 bits. 18 1999 ©UCB

Changes to Datapath Contd. ° For lw instn, we need write register address at stage 5. But the IR is now occupied by another instn! So, we must carry the IR destination field as we move along the stages. See connection in fig. Length of ID/EX register = (Reg 1: 32)+(Reg 2: 32)+(offset: 32)+ (PC: 32)+ (destination register: 5) 133 bits = Assignment: What are the lengths of EX/MEM, and MEM/WB registers 19 1999 ©UCB

Pipelined Datapath (with Pipeline Regs)(6. 2) Fetch Decode Execute Memory Write Back 0 M u x 1 IF/ID EX/MEM ID/EX MEM/WB Add 4 Add result PC Address Ins tructio n Shift left 2 Read register 1 Read data 1 Read register 2 Read data 2 Write register Imem Write data 0 M u x 1 Regs Zero ALU result Address Write data 16 Sign extend 32 Read data 1 M u x 0 Dmem 5 20 64 bits 133 bits 102 bits 69 bits 1999 ©UCB