18 447 Computer Architecture Lecture 8 Microprogramming and

Reminder: Homeworks n Homework 2 q q n Due today ISA concepts, ISA vs.

Homework 1 Grades Number of Students 35 30 25 20 15 10 5 0

Reminder: Lab Assignments n Getting your Lab 1 fully correct q n We will

Reminder: Extra Credit for Lab Assignment 2 n Complete your normal (single-cycle) implementation first,

Readings for Today n Pipelining q P&H Chapter 4. 5 -4. 8 6

Readings for Next Lecture n Required q n Pipelined LC-3 b Microarchitecture Handout Optional

Announcement: Discussion Sessions n Lab sessions are really discussion sessions n TAs will lead

State Machine for LDW State 18 (010010) State 33 (100001) State 35 (100011) State

The Microsequencer: Some Questions n When is the IRD signal asserted? n What happens

The Control Store: Some Questions n What control signals can be stored in the

Variable-Latency Memory n The ready signal (R) enables memory read/write to execute correctly q

The Microsequencer: Advanced Questions n What happens if the machine is interrupted? n n

The Power of Abstraction n The concept of a control store of microinstructions enables

Let’s Do Some Microprogramming n Implement REP MOVS in the LC-3 b microarchitecture n

Aside: Alignment Correction in Memory n Remember unaligned accesses n LC-3 b has byte

Aside: Memory Mapped I/O n n n Address control logic determines whether the specified

Advantages of Microprogrammed Control n Allows a very simple datapath to do powerful computation

Update of Machine Behavior n The ability to update/patch microcode in the field (after

Microcoded Multi-Cycle MIPS Design n P&H, Appendix D n Any ISA can be implemented

Microcoded Multi-Cycle MIPS Design [Based on original figure from P&H CO&D, COPYRIGHT 2004 Elsevier.

Control Logic for MIPS FSM [Based on original figure from P&H CO&D, COPYRIGHT 2004

Microprogrammed Control for MIPS FSM [Based on original figure from P&H CO&D, COPYRIGHT 2004

n-bit m. PC input [Based on original figure from P&H CO&D, COPYRIGHT 2004 Elsevier.

Vertical Microcode 1 -bit signal means do this RT (or combination of RTs) n-bit

Nanocode and Millicode n Nanocode: a level below mcode q n Millicode: a level

Nanocode Concept Illustrated a “mcoded” processor implementation ROM m. PC processor datapath We refer

Multi-Cycle vs. Single-Cycle u. Arch n Advantages n Disadvantages n You should be very

Microprogrammed vs. Hardwired Control n Advantages n Disadvantages n You should be very familiar

Can We Do Better? n What limitations do you see with the multi-cycle design?

Can We Use the Idle Hardware to Improve Concurrency? n n Goal: Concurrency throughput

Pipelining: Basic Idea n More systematically: q q n Pipeline the execution of multiple

Example: Execution of Four Independent ADDs n Multi-cycle: 4 cycles per instruction F D

The Laundry Analogy n n “place one dirty load of clothes in the washer”

Pipelining Multiple Loads of Laundry - 4 loads of laundry in parallel - no

Pipelining Multiple Loads of Laundry: In Practice the slowest step decides throughput Based on

Pipelining Multiple Loads of Laundry: In Practice A B Throughput restored (2 loads per

We did not cover the following slides in lecture. These are for your preparation

An Ideal Pipeline n n Goal: Increase throughput with little increase in cost (hardware

Ideal Pipelining combinational logic (F, D, E, M, W) T psec T/2 ps (F,

More Realistic Pipeline: Throughput n Nonpipelined version with delay T BW = 1/(T+S) where

More Realistic Pipeline: Cost n Nonpipelined version with combinational cost G Cost = G+L

Slides: 52

Download presentation

18 -447 Computer Architecture Lecture 8: Microprogramming and Pipelined Microarchitectures Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 2/13/2012

Reminder: Homeworks n Homework 2 q q n Due today ISA concepts, ISA vs. microarchitecture, microcoded machines Homework 3 q Will be out tomorrow 2

Homework 1 Grades Number of Students 35 30 25 20 15 10 5 0 50 60 70 80 Grade 90 100 Average Median Max Min 100 103 110 55 Max Possible Points 110 Total number of students 110 56 3

Reminder: Lab Assignments n Getting your Lab 1 fully correct q n We will allow resubmission once, just for the purposes of testing the correctness of your revised code (no regrading) Lab Assignment 2 q q Due Friday, Feb 17, at the end of the lab Individual assignment n n No collaboration; please respect the honor code Lab Assignment 3 q Will be out Wednesday 4

Reminder: Extra Credit for Lab Assignment 2 n Complete your normal (single-cycle) implementation first, n n n and get it checked off in lab. Then, implement the MIPS core using a microcoded approach similar to what we are discussing in class. We are not specifying any particular details of the microcode format or the microarchitecture; you should be creative. For the extra credit, the microcoded implementation should execute the same programs that your ordinary implementation does, and you should demo it by the normal lab deadline. 5

Readings for Today n Pipelining q P&H Chapter 4. 5 -4. 8 6

Readings for Next Lecture n Required q n Pipelined LC-3 b Microarchitecture Handout Optional q Hamacher et al. book, Chapter 6, “Pipelining” 7

Announcement: Discussion Sessions n Lab sessions are really discussion sessions n TAs will lead recitations q q q n Go over past lectures Answer and ask questions Solve problems and homeworks Please attend any session you wish q q q Tue 10: 30 am-1: 20 pm (Chris) Thu 1: 30 -4: 20 pm (Lavanya) Fri 6: 30 -9: 20 pm (Abeer) 8

An Exercise in Microcoding 9

A Simple LC-3 b Control and Datapath 10

State Machine for LDW State 18 (010010) State 33 (100001) State 35 (100011) State 32 (100000) State 6 (000110) State 25 (011001) State 27 (011011) Microsequencer

The Microsequencer: Some Questions n When is the IRD signal asserted? n What happens if an illegal instruction is decoded? n What are condition (COND) bits for? n How is variable latency memory handled? n How do you do the state encoding? q q q Minimize number of state variables Start with the 16 -way branch Then determine constraint tables and states dependent on COND 20

The Control Store: Some Questions n What control signals can be stored in the control store? vs. n What control signals have to be generated in hardwired logic? q i. e. , what signal cannot be available without processing in the datapath? 21

Variable-Latency Memory n The ready signal (R) enables memory read/write to execute correctly q n Example: transition from state 33 to state 35 is controlled by the R bit asserted by memory when memory data is available Could we have done this in a single-cycle microarchitecture? 22

The Microsequencer: Advanced Questions n What happens if the machine is interrupted? n n What if an instruction generates an exception? How can you implement a complex instruction using this control structure? q Think REP MOVS 23

The Power of Abstraction n The concept of a control store of microinstructions enables the hardware designer with a new abstraction: microprogramming The designer can translate any desired operation to a sequence microinstructions All the designer needs to provide is q q q The sequence of microinstructions needed to implement the desired operation The ability for the control logic to correctly sequence through the microinstructions Any additional datapath control signals needed (no need if the operation can be “translated” into existing control signals) 24

Let’s Do Some Microprogramming n Implement REP MOVS in the LC-3 b microarchitecture n What changes, if any, do you make to the q q n n state machine? datapath? control store? microsequencer? Show all changes and microinstructions Coming up in Homework 3 25

Aside: Alignment Correction in Memory n Remember unaligned accesses n LC-3 b has byte load and byte store instructions that move data not aligned at the word-address boundary q n Convenience to the programmer/compiler How does the hardware ensure this works correctly? q q q Take a look at state 29 for LDB State 17 for STB Additional logic to handle unaligned accesses 26

Aside: Memory Mapped I/O n n n Address control logic determines whether the specified address of LDx and STx are to memory or I/O devices Correspondingly enables memory or I/O devices and sets up muxes Another instance where the final control signals (e. g. , MEM. EN or INMUX/2) cannot be stored in the control store q Dependent on address 27

Advantages of Microprogrammed Control n Allows a very simple datapath to do powerful computation by controlling the datapath (using a sequencer) q q q n Enables easy extensibility of the ISA q q n High-level ISA translated into microcode (sequence of microinstructions) Microcode enables a minimal datapath to emulate an ISA Microinstructions can be thought of a user-invisible ISA Can support a new instruction by changing the ucode Can support complex instructions as a sequence of simple microinstructions If I can sequence an arbitrary instruction then I can sequence an arbitrary “program” as a microprogram sequence q will need some new state (e. g. loop counters) in the microcode for sequencing more elaborate programs 28

Update of Machine Behavior n The ability to update/patch microcode in the field (after a processor is shipped) enables q q n Ability to add new instructions without changing the processor! Ability to “fix” buggy hardware implementations Examples q q IBM 370 Model 145: microcode stored in main memory, can be updated after a reboot B 1700 microcode can be updated while the processor is running n User-microprogrammable machine! 29

Microcoded Multi-Cycle MIPS Design n P&H, Appendix D n Any ISA can be implemented this way n n We will not cover this in class However, you can do an extra credit assignment for Lab 2 30

n-bit m. PC input [Based on original figure from P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED. ] ALUSrc. A Ior. D IRWrite PCWrite. Cond …. Control Store: 2 n k bit (not including sequencing) k-bit “control” output Horizontal Microcode 34

Vertical Microcode 1 -bit signal means do this RT (or combination of RTs) n-bit m. PC input “PC PC+4” “PC ALUOut” “PC PC[ 31: 28 ], IR[ 25: 0 ], 2’b 00” “IR MEM[ PC ]” “A RF[ IR[ 25: 21 ] ]” “B RF[ IR[ 20: 16 ] ]” ……. …………. m-bit input ROM k-bit output ALUSrc. A Ior. D IRWrite PCWrite. Cond …. [Based on original figure from P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED. ] If done right (i. e. , m<<n, and m<<k), two ROMs together (2 n m+2 m k bit) should be smaller than horizontal microcode ROM (2 n k bit) 35

Nanocode and Millicode n Nanocode: a level below mcode q n Millicode: a level above mcode q n mprogrammed control for sub-systems (e. g. , a complicated floatingpoint module) that acts as a slave in a mcontrolled datapath ISA-level subroutines hardcoded into a ROM that can be called by the mcontroller to handle complicated operations In both cases, we avoid complicating the main mcontroller 36

Nanocode Concept Illustrated a “mcoded” processor implementation ROM m. PC processor datapath We refer to this as “nanocode” when a mcoded subsystem is embedded in a mcoded system a “mcoded” FPU implementation ROM m. PC arithmetic datapath 37

Multi-Cycle vs. Single-Cycle u. Arch n Advantages n Disadvantages n You should be very familiar with this right now 38

Microprogrammed vs. Hardwired Control n Advantages n Disadvantages n You should be very familiar with this right now 39

Can We Do Better? n What limitations do you see with the multi-cycle design? n Limited concurrency q q q Some hardware resources are idle during different phases of instruction processing cycle “Fetch” logic is idle when an instruction is being “decoded” or “executed” Most of the datapath is idle when a memory access is happening 40

Can We Use the Idle Hardware to Improve Concurrency? n n Goal: Concurrency throughput (more “work” completed in one cycle) Idea: When an instruction is using some resources in its processing phase, process other instructions on idle resources not needed by that instruction q q E. g. , when an instruction is being decoded, fetch the next instruction E. g. , when an instruction is being executed, decode another instruction E. g. , when an instruction is accessing data memory (ld/st), execute the next instruction E. g. , when an instruction is writing its result into the register file, access data memory for the next instruction 41

Pipelining: Basic Idea n More systematically: q q n Pipeline the execution of multiple instructions Analogy: “Assembly line processing” of instructions Idea: q q q Divide the instruction processing cycle into distinct “stages” of processing Ensure there are enough hardware resources to process one instruction in each stage Process a different instruction in each stage n n n Instructions consecutive in program order are processed in consecutive stages Benefit: Increases instruction processing throughput (1/CPI) Downside: Start thinking about this… 42

Example: Execution of Four Independent ADDs n Multi-cycle: 4 cycles per instruction F D E W Time n Pipelined: 4 cycles per 4 instructions (steady state) F D E W Time 43

The Laundry Analogy n n “place one dirty load of clothes in the washer” “when the washer is finished, place the wet load in the dryer” “when the dryer is finished, take out the dry load and fold” “when folding is finished, ask your roommate (? ? ) to put the clothes away” - steps to do a load are sequentially dependent - no dependence between different loads - different steps do not share resources Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED. ] 44

Pipelining Multiple Loads of Laundry - 4 loads of laundry in parallel - no additional resources - throughput increased by 4 - latency per load is the same Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED. ] 45

We did not cover the following slides in lecture. These are for your preparation for the next lecture.

An Ideal Pipeline n n Goal: Increase throughput with little increase in cost (hardware cost, in case of instruction processing) Repetition of identical operations q n Repetition of independent operations q n No dependencies between repeated operations Uniformly partitionable suboperations q n The same operation is repeated on a large number of different inputs Processing an be evenly divided into uniform-latency suboperations (that do not share resources) Good examples: automobile assembly line, doing laundry q What about instruction processing pipeline? 49

Ideal Pipelining combinational logic (F, D, E, M, W) T psec T/2 ps (F, D, E) T/3 ps (F, D) BW=~(1/T) BW=~(2/T) T/2 ps (M, W) T/3 ps (E, M) T/3 ps (M, W) BW=~(3/T) 50

More Realistic Pipeline: Throughput n Nonpipelined version with delay T BW = 1/(T+S) where S = latch delay T ps n k-stage pipelined version BWk-stage = 1 / (T/k +S ) BWmax = 1 / (1 gate delay + S ) T/k ps 51

More Realistic Pipeline: Cost n Nonpipelined version with combinational cost G Cost = G+L where L = latch cost G gates n k-stage pipelined version Costk-stage = G + Lk G/K 52