AsanovicDevadas Spring 2002 6 823 Microprogramming Krste Asanovic

  • Slides: 31
Download presentation
Asanovic/Devadas Spring 2002 6. 823 Microprogramming Krste Asanovic Laboratory for Computer Science Massachusetts Institute

Asanovic/Devadas Spring 2002 6. 823 Microprogramming Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology

Asanovic/Devadas Spring 2002 6. 823 Instruction Set Architecture (ISA) versus Implementation q ISA is

Asanovic/Devadas Spring 2002 6. 823 Instruction Set Architecture (ISA) versus Implementation q ISA is the hardware/software interface n Defines set of programmer visible state n Defines instruction format (bit encoding) and instruction semantics n Examples: DLX, x 86, IBM 360, JVM q Many possible implementations of one ISA n 360 implementations: model 30 (c. 1964), z 900 (c. 2001) n x 86 implementations: 8086 (c. 1978), 80186, 286, 386, 486, Pentium Pro, Pentium-4 (c. 2000), AMD Athlon, Transmeta Crusoe, Soft. PC n DLX implementations: microcoded, pipelined, superscalar n JVM: Hot. Spot, Pico. Java, ARM Jazelle, . . .

ISA to Microarchitecture Mapping Asanovic/Devadas Spring 2002 6. 823 q ISA often designed for

ISA to Microarchitecture Mapping Asanovic/Devadas Spring 2002 6. 823 q ISA often designed for particular microarchitectural style, e. g. , n n CISC ISAs designed for microcoded implementation RISC ISAs designed for hardwired pipelined implementation VLIW ISAs designed for fixed latency in-order pipelines JVM ISA designed for software interpreter q But ISA can be implemented in any microarchitectural style n n Pentium-4: hardwired pipelined CISC (x 86) machine (with somemicrocode support) This lecture: a microcoded RISC (DLX) machine Intel will probably eventually have a dynamically scheduled outof-order VLIW (IA-64) processor Pico. Java: A hardware JVM processor

Asanovic/Devadas Spring 2002 6. 823 Microcoded Microarchitecture Microcode instructions fixed in ROM inside microcontroller

Asanovic/Devadas Spring 2002 6. 823 Microcoded Microarchitecture Microcode instructions fixed in ROM inside microcontroller busy? zero? opcode μcontroller Datapath Data Addr Memory (RAM) holds user program written using macrocode instructions (e. g. , DLX, x 86, etc. ) en. Mem. Wrt

A Bus-based Datapath for DLX Opcode ld. IR busy zero Op. Sel ld. A

A Bus-based Datapath for DLX Opcode ld. IR busy zero Op. Sel ld. A ld. B ld. MA Reg. Sel addr IR Ext. Sel Imm Ext en. Imm ALU control 32 GPRs + PC. . . ALU 32 -bit Reg data en. ALU Asanovic/Devadas Spring 2002 6. 823 MA addr Reg. Wrt Memory Mem. Wrt en. Reg data en. Mem Bus Microinstruction: register to register transfer (17 control signals) MA ← PC means Reg. Sel = PC; en. Reg=yes; ld. MA= yes B ← Reg[rf 2] means Reg. Sel = rf 2; en. Reg=yes; ld. B = yes

Instruction Execution Asanovic/Devadas Spring 2002 6. 823 Execution of a DLX instruction involves 1.

Instruction Execution Asanovic/Devadas Spring 2002 6. 823 Execution of a DLX instruction involves 1. 1 instruction fetch 2. decode and register fetch 3. ALU operation 4. memory operation (optional) 5. write back to register file (optional) and the computation of the address of the next instruction

Asanovic/Devadas Spring 2002 6. 823 Microprogram Fragments instr fetch: MA ← PC IR ←

Asanovic/Devadas Spring 2002 6. 823 Microprogram Fragments instr fetch: MA ← PC IR ← Memory A ← PC PC ← A + 4 dispatch on OPcode can be treated as a macro ALU: A ← Reg[rf 1] B ← Reg[rf 2] Reg[rf 3] ← func(A, B) do instruction fetch ALUi: A ← Reg[rf 1] B ← Imm sign extention. . . Reg[rf 2] ← Opcode(A, B) do instruction fetch

Asanovic/Devadas Spring 2002 6. 823 Microprogram Fragments (cont. ) LW: J: beqz: bz-taken: A

Asanovic/Devadas Spring 2002 6. 823 Microprogram Fragments (cont. ) LW: J: beqz: bz-taken: A ← Reg[rf 1] B ← Imm MA ← A + B Reg[rf 2] ← Memory do instruction fetch A ← PC B ← Imm PC ← A + B do instruction fetch A ← Reg[rf 1] If zero? (A) then go to bz-taken do instruction fetch A ← PC B ← Imm PC ← A + B do instruction fetch

DLX Microcontroller: Opcode zero? Busy (memory) first attempt latching the inputs may cause a

DLX Microcontroller: Opcode zero? Busy (memory) first attempt latching the inputs may cause a one-cycle delay μPC (state) addr ROM Size ? How big is “s”? μProgram ROM 2(opcode+status+s) words word = (control+s) bits data next state 17 Control Signals Asanovic/Devadas Spring 2002 6. 823

Asanovic/Devadas Spring 2002 6. 823 Microprogram in the ROM worksheet State Op zero? busy

Asanovic/Devadas Spring 2002 6. 823 Microprogram in the ROM worksheet State Op zero? busy Control points next-state

Asanovic/Devadas Spring 2002 6. 823 Microprogram in the ROM State Op zero? busy Control

Asanovic/Devadas Spring 2002 6. 823 Microprogram in the ROM State Op zero? busy Control points next-state

Asanovic/Devadas Spring 2002 6. 823 Microprogram in the ROM Cont. State Op zero? busy

Asanovic/Devadas Spring 2002 6. 823 Microprogram in the ROM Cont. State Op zero? busy Control points next-state

Size of Control Store status & opcode Asanovic/Devadas Spring 2002 6. 823 μPC size

Size of Control Store status & opcode Asanovic/Devadas Spring 2002 6. 823 μPC size = 2(w+s) x (c + s) Control signals Control ROM next μPC DLX w = 6+2 c = 17 s=? no. of steps per opcode = 4 to 6 + fetch-sequence no. of states ≈ (4 steps per op-group ) x op-groups + common sequences = 4 x 8 + 10 states = 42 states ⇒s=6 Control ROM = 2(8+6) x 23 bits ≈ 48 Kbytes

Asanovic/Devadas Spring 2002 6. 823 Reducing Size of Control Store Control store has to

Asanovic/Devadas Spring 2002 6. 823 Reducing Size of Control Store Control store has to be fast ⇒expensive q Reduce the ROM height (= address bits) ⇒ reduce inputs by extra external logic each input bit doubles the size of the control store ⇒ reduce states by grouping opcodes find common sequences of actions ⇒ condense input status bits combine all exceptions into one, i. e. , exception/no-exception q Reduce the ROM width ⇒ restrict the next-state encoding Next, Dispatch on opcode, Wait for memory, . . .

Asanovic/Devadas Spring 2002 6. 823 DLX Controller V 2 Opcode ext absolute μPC Reduced

Asanovic/Devadas Spring 2002 6. 823 DLX Controller V 2 Opcode ext absolute μPC Reduced ROM height by encoding inputs μPC (state) Control ROM μPC+1 μPCSrc jump logic zero buzy Reduce ROM width by encoding next-state 17 Control Signals μJump. Type (next, spin, fetch, dispatch, feqz, fnez )

Jump Logic Asanovic/Devadas Spring 2002 6. 823 μPCSrc = Case μJump. Types next ⇒

Jump Logic Asanovic/Devadas Spring 2002 6. 823 μPCSrc = Case μJump. Types next ⇒ μPC+1 spin ⇒ if (busy) then μPC else μPC+1 fetch ⇒ absolute dispatch ⇒ op-group feqz ⇒ if (zero) then absolute else μPC+1 fnez ⇒ if (zero) then μPC+1 else absolute

Instruction Fetch & ALU: Asanovic/Devadas Spring 2002 6. 823 DLX-Controller-2 State Control points next-state

Instruction Fetch & ALU: Asanovic/Devadas Spring 2002 6. 823 DLX-Controller-2 State Control points next-state fetch 0 fetch 1 fetch 2 fetch 3. . . ALU 0 ALU 1 ALU 2 MA ← PC IR ← Memory A ← PC PC ← A + 4 next spin next dispatch A ← Reg[rf 1] B ← Reg[rf 2] Reg[rf 3]← func(A, B) next fetch ALUi 0 ALUi 1 ALUi 2 A ← Reg[rf 1] B ← s. Ext 16(Imm) Reg[rf 3]← Op(A, B) next fetch

Load & Store: DLX-Controller-2 State Control points LW 0 LW 1 LW 2 LW

Load & Store: DLX-Controller-2 State Control points LW 0 LW 1 LW 2 LW 3 LW 4 A ← Reg[rf 1] B ← s. Ext 16(Imm) MA ← A+B Reg[rf 2] ← Memory SW 0 SW 1 SW 2 SW 3 SW 4 A ← Reg[rf 1] B ← s. Ext 16(Imm) MA ← A+B Memory ← Reg[rf 2] next-state next next spin fetch Asanovic/Devadas Spring 2002 6. 823

Branches: DLX-Controller-2 State Control points BEQZ 0 BEQZ 1 BEQZ 2 BEQZ 3 BEQZ

Branches: DLX-Controller-2 State Control points BEQZ 0 BEQZ 1 BEQZ 2 BEQZ 3 BEQZ 4 A ← Reg[rf 1] BNEZ 0 BNEZ 1 BNEZ 2 BNEZ 3 BNEZ 4 A ← Reg[rf 1] A ← PC B ← s. Ext 16(Imm) PC ← A+B next-state next fnez next fetch next feqz next fetch Asanovic/Devadas Spring 2002 6. 823

Jumps: DLX-Controller-2 State J 0 J 1 J 2 Control points A ← PC

Jumps: DLX-Controller-2 State J 0 J 1 J 2 Control points A ← PC B ← s. Ext 26(Imm) PC ← A+B JR 0 PC ←Reg[rf 1] fetch JAL 0 JAL 1 JAL 2 JAL 3 A ← PC Reg[31] ← A B ← s. Ext 26(Imm) PC ← A+B next fetch JALR 0 A ← PC JALR 1 Reg[31] ← A JALR 2 PC ←Reg[rf 1] next-state next fetch Asanovic/Devadas Spring 2002 6. 823

Asanovic/Devadas Spring 2002 6. 823 Microprogramming in IBM 360 M 30 M 40 M

Asanovic/Devadas Spring 2002 6. 823 Microprogramming in IBM 360 M 30 M 40 M 50 M 65 Datapath width (bits) 8 16 32 64 μinst width (bits) 50 52 85 87 μcode size (K μinsts) 4 4 2. 75 μstore technology CCROS TCROS BCROS 750 650 500 200 1500 2000 750 4 7 15 35 μstore cycle (ns) memore cycle (ns) Rental fee ($K/month) q Only fastest models (75 and 95) were hardwired

Asanovic/Devadas Spring 2002 6. 823 Horizontal vs Vertical μCode Bits per μInstructions q Horizontal

Asanovic/Devadas Spring 2002 6. 823 Horizontal vs Vertical μCode Bits per μInstructions q Horizontal μcode has longer μinstructions n n n Can specify multiple parallel operations per μinstruction Needs fewer steps to complete each macroinstruction Sparser encoding ⇒ more bits q Vertical μcode has more, narrower μinstructions n n In limit, only single datapath operation per μinstruction μcode branches require separate μinstruction More steps to complete each macroinstruction More compact ⇒ less bits q Nanocoding n Tries to combine best of horizontal and vertical μcode

Asanovic/Devadas Spring 2002 6. 823 Nanocoding Exploits recurring control signal patterns in μcode, e.

Asanovic/Devadas Spring 2002 6. 823 Nanocoding Exploits recurring control signal patterns in μcode, e. g. , ALU 0 A ← Reg[rf 1]. . . ALUi 0 A ← Reg[rf 1]. . . μPC (state) μcode next -state μaddress μcode ROM nanoaddress nanoinstruction ROM data 17 Control Signals q MC 68000 had 17 -bit μcode containing either 10 -bit μjump or 9 -bit nanoinstruction pointer n Nanoinstructions were 68 bits wide, decoded to give 196 control signals

Asanovic/Devadas Spring 2002 6. 823 Implementing Complex Instructions Opcode ld. IR busy zero Op.

Asanovic/Devadas Spring 2002 6. 823 Implementing Complex Instructions Opcode ld. IR busy zero Op. Sel ld. A ld. B ld. MA Reg. Sel addr 32 GPRs + PC. . . Ext. Sel Imm Ext en. Imm ALU 32 -bit Reg addr Reg. Wrt Memory en. Reg data en. ALU MA Mem. Wrt en. Mem data Bus rf 3 ← M[(rf 1)] op (rf 2) Memory-src ALU op M[(rf 3)] ← M[(rf 1)] op M[(rf 2)] Reg-

Asanovic/Devadas Spring 2002 6. 823 Mem-Mem ALU Instructions: DLX-Controller-2 Mem-Mem ALU op ALUMM 0

Asanovic/Devadas Spring 2002 6. 823 Mem-Mem ALU Instructions: DLX-Controller-2 Mem-Mem ALU op ALUMM 0 ALUMM 1 ALUMM 2 ALUMM 3 ALUMM 4 ALUMM 5 ALUMM 6 M[(rf 3)] ← M[(rf 1)] op M[(rf 2)] MA ← Reg[rf 1] A ← Memory MA ← Reg[rf 2] B ← Memory MA ← Reg[rf 3] Memory ← func(A, B) next spin fetch Complex instructions usually do not require datapath modifications in a microprogrammed implementation --only extra space for the control program Implementing these instructions using a hardwired controller is difficult without datapath modifications

Microcode Emulation Asanovic/Devadas Spring 2002 6. 823 q IBM initially miscalculated importance of software

Microcode Emulation Asanovic/Devadas Spring 2002 6. 823 q IBM initially miscalculated importance of software compatibility when introducing 360 series q Honeywell started effort to steal IBM 1401 customers by offering translation software (“Liberator”) for Honeywell H 200 series machine q IBM retaliates with optional additional microcode for 360 series that can emulate IBM 1401 ISA, later extended for IBM 7000 series n one popular program on 1401 was a 650 simulator, so somecustomers ran many 650 programs on emulated 1401 s (650 ->1401 ->360)

Microprogramming in the Seventies Asanovic/Devadas Spring 2002 6. 823 Thrived because: q. Significantly faster

Microprogramming in the Seventies Asanovic/Devadas Spring 2002 6. 823 Thrived because: q. Significantly faster ROMs than DRAMs were available q. For complex instruction sets, datapath and controller were cheaper and simpler q. New instructions , e. g. , floating point, could be supported without datapath modifications q. Fixing bugs in the controller was easier q. ISA compatibility across various models could be achieved easily and cheaply Except for cheapest and fastest machines, all computers were microprogrammed

Asanovic/Devadas Spring 2002 6. 823 Writable Control Store (WCS) Implement control store with SRAM

Asanovic/Devadas Spring 2002 6. 823 Writable Control Store (WCS) Implement control store with SRAM not ROM • MOS SRAM memories now almost as fast as control store (core memories/DRAMs were 10 x slower) • Bug-free microprograms difficult to write User-WCS provided as option on several minicomputers • Allowed users to change microcode for each process User-WCS failed • Little or no programming tools support • Hard to fit software into small space • Microcode control tailored to original ISA, less useful for others • Large WCS part of processor state -expensive context switches • Protection difficult if user can change microcode • Virtual memory required restartable microcode

Performance Issues Asanovic/Devadas Spring 2002 6. 823 Microprogrammed control ⇒ multiple cycles per instruction

Performance Issues Asanovic/Devadas Spring 2002 6. 823 Microprogrammed control ⇒ multiple cycles per instruction Cycle time ? t. C > max(treg-reg, t. ALU, tμROM, t. RAM) Given complex control, t. ALU & t. RAM can be broken into multiple cycles. However, tμROM cannot be broken down. Hence t. C > max(treg-reg, tμROM) Suppose 10 * tμROM < t. RAM good performance, relative to the single-cycle hardwired implementation, can be achieved even with a CPI of 10

Asanovic/Devadas Spring 2002 6. 823 VLSI & Microprogramming By late seventies • technology assumption

Asanovic/Devadas Spring 2002 6. 823 VLSI & Microprogramming By late seventies • technology assumption about ROM & RAM speed became invalid • micromachines became more complicated • to overcome slower ROM, micromachines were pipelined • complex instruction sets led to the need for subroutine and call stacks in μcode. • need for fixing bugs in control programs was in conflict with read-only nature of μROM ⇒ WCS (B 1700, QMachine, Intel 432, …) • introduction of caches and buffers, especially for instructions, made multiple-cycle execution of reg-reg instructions unattractive

Modern Usage Asanovic/Devadas Spring 2002 6. 823 Microprogramming is far from extinct Played a

Modern Usage Asanovic/Devadas Spring 2002 6. 823 Microprogramming is far from extinct Played a crucial role in micros of the Eighties, Motorola 68 K series Intel 386 and 486 Microcode is present in most modern CISC micros in an assisting role (e. g. AMD Athlon, Intel Pentium-4) • Most instructions are executed directly, i. e. , with hard-wired control • Infrequently-used and/or complicated instructions invoke the microcode engine Patchable microcode common for post-fabrication bug fixes, e. g. Intel Pentiums load μcode patches at bootup