COSC 3 P 92 Cosc 3 P 92

COSC 3 P 92 Hardware components MIC(overview) • MAR and MDR are registers which

COSC 3 P 92 Hardware components MIC (overview) • Translate byte address 0, 1,

COSC 3 P 92 Hardware components MIC (overview) • Each micro instruction controls –

COSC 3 P 92 Hardware components MIC (overview) 5

COSC 3 P 92 Memory control • MAR - memory address register – CPU

COSC 3 P 92 Control unit • main functions of a control unit: -

COSC 3 P 92 Execution unit • An execution unit consists of: – a

COSC 3 P 92 Data transfer within a CPU • A single-bus architecture: Buffer

COSC 3 P 92 Data transfer within a CPU • A two-bus architecture BUS

COSC 3 P 92 Data transfer within a CPU • A three-bus architecture: BUS

COSC 3 P 92 Design of control units • Hardwired approach • The control

COSC 3 P 92 Microprogramming • Use of memory to implement the control unit

COSC 3 P 92 Microprogramming • What is being controlled? Register Control values Combinational

COSC 3 P 92 Microprogramming • Each control point specifies a micro-operation – All

COSC 3 P 92 Microprogramming • Basic microinstruction formats: {Over heads} 16

COSC 3 P 92 Data path • 32 -bit registers (none are useraccessible) •

COSC 3 P 92 Data path • ALU control: 6 control lines • shifter:

COSC 3 P 92 Data path timing • Four sub-cycles: – – 1. control

COSC 3 P 92 Data path timing • These are implicit sub-cycles: they rely

COSC 3 P 92 Memory again • 2 memory buffers: – 32 bit port:

COSC 3 P 92 Memory again • MAR aligned to words (32 bits, 4

COSC 3 P 92 Microinstructions • 29 signals for data path: – – –

COSC 3 P 92 Microinstructions • Fields: – – – Addr: address of next

COSC 3 P 92 Example microarchitectu re: Mic-1 25

COSC 3 P 92 Example microarchitecture: Mic-1 • sequencer: executes microinstructions • Two tasks:

COSC 3 P 92 Mic-1 operation cycle • Basic ALU cycle: – 1. set

COSC 3 P 92 Mic-1 sequencing • First, 9 -bit next addr field copied

COSC 3 P 92 Microinstructions and notation • As in assembler programming, helps to

COSC 3 P 92 Microinstructions and notation • Memory takes 2 cycles: MAR=SP; rd

COSC 3 P 92 Example M. I. implementation: IJVM • A stack-based virtual machine

COSC 3 P 92 Implementation (cont) • See overheads (book page 234 -236) •

COSC 3 P 92 Implementation (cont) • Example 1: iadd (“pop 2 words from

COSC 3 P 92 Implementation (cont) • Example 3: goto offset (“unconditional branch”) –

COSC 3 P 92 Improving performance • 1. Faster clock, transistors, electrical circuits •

COSC 3 P 92 Improving performance • 5. Instruction fetch unit [4. 27] –

COSC 3 P 92 Improving performance • Instruction fetch unit (cont) – shift register:

COSC 3 P 92 Improving performance • Mic-2: – A, B buses – IFU

COSC 3 P 92 Improving performance: 6. Pipelining • divide instn. execution into modular

COSC 3 P 92 Improving performance: 6. Pipelining • need 3 cycles now to

COSC 3 P 92 Pipelining (cont) • [4. 32, 4. 33, 4. 44] •

COSC 3 P 92 Pipelining (cont) • One complication: memory reads – takes 2

COSC 3 P 92 Pipelines and branch prediction • unconditional branches – fetch unit

COSC 3 P 92 Pipelines and branch prediction • static branch prediction: carried out

COSC 3 P 92 Improving performance: out-oforder exec, reg renaming • instruction ops can

COSC 3 P 92 In-order exec, in-order completion – decode in cyc n, exec

COSC 3 P 92 Out-of-order exec, reg renaming (cont) – idea: execute instns so

COSC 3 P 92 Improving performance: speculative exec • block: a section of sequential

COSC 3 P 92 Example 1: Pentium II • 1. Fetch/decode [4. 46] –

COSC 3 P 92 Example 2: Ultra. SPARC II • [4. 49] • RISC:

COSC 3 P 92 Example 3: pico. Java II • [4. 51] • instn,

COSC 3 P 92 • 6 -stage pipeline [4. 52] – CISC instns –

COSC 3 P 92 Folding • Folding [4. 53, 4. 54, 4. 55] –

COSC 3 P 92 Comparing these examples • common features – – all m.

Slides: 76

Download presentation

COSC 3 P 92 Cosc 3 P 92 Week 5 Lecture slides Voters quickly forget what a man says. Richard M. Nixon (1913 -1994) Former U. S. President 1

COSC 3 P 92 Hardware components MIC(overview) • MAR and MDR are registers which latch the addresses and data prior to processing 2

COSC 3 P 92 Hardware components MIC (overview) • Translate byte address 0, 1, 2, 3… to 4 byte words. – Shift 2 bits left. – Causes word 0, 1, 2, 3 … to be addressed. – Alignment of words. 3

COSC 3 P 92 Hardware components MIC (overview) • Each micro instruction controls – – – register enables bus enables ALU Memory Next Micro instruction address 4

COSC 3 P 92 Hardware components MIC (overview) 5

COSC 3 P 92 Memory control • MAR - memory address register – CPU writes addresses of memory to read, write • MBR - memory buffer register – contains data for write or read • both act as ‘latches’ to hold addr, data until memory finished using them. 6

COSC 3 P 92 Control unit • main functions of a control unit: - instruction interpretation - instruction sequencing External command signals Control Unit Master clock Control signals Status signals Execution Unit CPU • the control unit is a finite-state machine. 7

COSC 3 P 92 Execution unit • An execution unit consists of: – a register section – an ALU – some dedicated hardware or firmware General purpose registers Dedicated registers R 0 R 1 • • • Rn-1 ALU (arithmetic logic unit) Control Unit SR (status reg) IR (instn reg) PC (prog cntr) SP (stack ptr) MAR (mem addr reg) MBR (mem buffer reg) etc. . . Dedicated multiply, division firmware (FP) Typical CPU model 8

COSC 3 P 92 Data transfer within a CPU • A single-bus architecture: Buffer reg. A Buffer reg. B R 0 PC general purpose regs etc special purpose regs R 1 etc ALU • To compute R 2 <– R 0 + R 1: 1. A <– R 0, 2. B <– R 1, 3. R 2 <– A+B 9

COSC 3 P 92 Data transfer within a CPU • A two-bus architecture BUS A PC R 0 etc Special II R 1 etc Special I General regs. MBR ALU buffer reg. BUS B • To compute R 2 <– R 0 + R 1: 1. Buffer <– R 0 + R 1 (via Bus A and Bus B), 2. R 2 <– Buffer (via either Bus A or Bus B). 10

COSC 3 P 92 Data transfer within a CPU • A three-bus architecture: BUS A BUS B PC R 0 etc Special II R 1 etc Special I MBR ALU BUS C • To compute R 2 <– R 0 + R 1: 1. R 2 <– R 0 + R 1 (via Bus A, Bus B and Bus C). 11

COSC 3 P 92 Design of control units • Hardwired approach • The control unit is treated as a synchronous (i. e. , clocked) sequential circuit and is implemented as a hardwired state machine. Feedback paths Inputs Register Next state Inputs Register AND plane Combinational Logic Register Outputs Register Transfer Model of Finite State Machine OR plane Register Outputs PLA Implementation of a Finite State Machine 12

COSC 3 P 92 Microprogramming • Use of memory to implement the control unit • Instructions are implemented as sequences of instructions stored in control memory • Each machine language instruction is interpreted by circuitry, and executed using sequences of microprogram instructions • Micro-programs are much like assembled code, except: – direct mapping between instruction fields and hardware components of the CPU. – control fields are specified. – timing is critical; parallelism can be exploited. 13

COSC 3 P 92 Microprogramming • What is being controlled? Register Control values Combinational Logic Register – data paths: inter-register connections – control points: hardware enabling lines which govern registerto-register communications • idea is that we can control the operation of ALU and micro-control unit using combinations of control fields encoded in micro-instructions 14

COSC 3 P 92 Microprogramming • Each control point specifies a micro-operation – All micro operations which may be executed in parallel can be specified in a single micro instruction. • Factors which determine parallel operations. – Buses must only have 1 input active at a time. – Registers can be either read/written » Not both at the same time. 15

COSC 3 P 92 Microprogramming • Basic microinstruction formats: {Over heads} 16

COSC 3 P 92 Data path • 32 -bit registers (none are useraccessible) • B bus: main one to ALU • C bus: from ALU back to registers • H reg: contains other operand for ALU – loaded by performing null op on data, and sending it to H 17

COSC 3 P 92 Data path • ALU control: 6 control lines • shifter: 2 control – 1. logical shift left 8 bits – 2. arithmetic shift right 8 bits 18

COSC 3 P 92 Data path timing • Four sub-cycles: – – 1. control signals set up (w) 2. registers loaded on B bus (x) 3. ALU and shifter (y) 4. results available to registers on C (z) 19

COSC 3 P 92 Data path timing • These are implicit sub-cycles: they rely on timing of previous steps • Only real clock signals used: – falling edge of clock (starts the cycle) – rising edge (loading from C in step 4) • ALU is continually processing all intermediate values it sees. It’s output only makes sense at the appropriate time above (after 3) • Can operate and save a register in 1 clock cycle: – load PC to B – inc – save to PC 20

COSC 3 P 92 Memory again • 2 memory buffers: – 32 bit port: MAR, MDR (read, write) » word addresses – 8 -bit: MBR » low byte from PC (read only) » byte addresses » can be loaded signed, unsigned onto B bus » call reads into MBR “fetches” • control: – black arrow: enable from C bus – white arrow: enable onto B bus • 2 bus control: – – out B in C out B / in C none 21

COSC 3 P 92 Memory again • MAR aligned to words (32 bits, 4 bytes): [4. 4] • Memory is available 2 cycles from when read was initiated – avail. at end of 2 nd cycle, so 3 rd cycle can use them 22

COSC 3 P 92 Microinstructions • 29 signals for data path: – – – 1. 9 signals to control C bus output into registers 2. 9 signals to enable registers onto B bus 3. 9 signals for ALU, shifter functions 4. 2 signals for memory W/R via MAR/MDR 5. 1 signal for memory fetch via PC/MBR • Issues: – may load more than 1 reg from C (9 bits) – but never load more than 1 reg onto B (4 bits, encoded will force this) --> 4 signals. • Need 2 more fields for determining next m. i. : – Next. Addr (9 bits, addr space of 512) – conditional jumps (3 bits) 23

COSC 3 P 92 Microinstructions • Fields: – – – Addr: address of next micro-instruction JAM: determines how next m. i. selected ALU: ALU, shifter control C: which registers written from C bus Mem: memory functions B: B source (encoded) 24

COSC 3 P 92 Example microarchitectu re: Mic-1 25

COSC 3 P 92 Example microarchitecture: Mic-1 • sequencer: executes microinstructions • Two tasks: – set control signals for system – determine next m. i. to execute • control store: contains m. i. for interpreting ISA instns. – each instn a 36 -bit word like [4. 5] – each m. i specifies its successor • MPC: Micro. Program Counter – 9 -bit address of next m. i. to execute • MIR: Micro. Instruction Register – 36 -bit m. i. being executed • Note that bits in MIR may directly control other parts of the circuit – eg. C 26

COSC 3 P 92 Mic-1 operation cycle • Basic ALU cycle: – 1. set up the inputs to the ALU – 2. let the ALU do its computation – 3. store the results • Clock cycles for Mic-1 – 1. MIR enabled (during subcycle w) – 2. MIR signals control data path (B bus; note H always enabled) (subcycle x) – 3. B and H inputs are stable, and ALU’s computes output ; shifter finishes; N, Z bits stable (subcycle y) – 4. shifter, N, Z outputs loaded from C but into registers » rising clock edge determines end » MIR is reloaded and calculated at this point as well » Memory read is initiated at end too • Note that all the above will complete in 1 cycle – microinstructions can specify all these operations in parallel 27

COSC 3 P 92 Mic-1 sequencing • First, 9 -bit next addr field copied into MPC • JAM inspected: – 000 = use MPC as it is – if JAMN (or JAMZ) set, then N bit (or Z) are ORed with high-bit of MPC » hence next address is either: MPC, MPC with high-bit ORed with 1 » – JMPC set: MBR byte ORed with low byte of Next. Addr field » permits multiway jumps » can quickly branch to instn for just-loaded opcodes (ie. opcode number = address in control store!) 28

COSC 3 P 92 Microinstructions and notation • As in assembler programming, helps to use higher-level notation instead of raw numeric m. i. fields • can specify everything that happens in 1 clock cycle: – permits parallelism: eg. prefetch next instns • Notation: high-level, but directly translatable to single m. i. ’s • Examples: – SP=SP+1: incr SP by 1 – MDR = SP: copy SP into MDR – MDR = SP+H; rd : add SP and H, save in MDR, and initiate a read – SP=MDR=SP+1: incr SP, load into both MDR, SP 29

COSC 3 P 92 Microinstructions and notation • Memory takes 2 cycles: MAR=SP; rd : assign value into MDR (another instn) * memory ready now! • next addresses: assume it is the labeled next m. i. after current one (unless a conditional jump) – if (Z) goto L 1; else goto L 2 : sets JAMZ » L 1 and L 2 are same low-8 bits (set by assembler) • Summary of legal operations on operands: 30

COSC 3 P 92 Example M. I. implementation: IJVM • A stack-based virtual machine for which Mic-1 is designed to implement. • All instructions access the stack: no general registers are used by compiler – eg. parameter passing [4. 8] – eg. arithmetic [4. 9] • Recall: – JVM instruction formats: [5. 15] – Java memory usage, registers: [4. 10] • Complete instruction set: [4. 11] • Example translated code: [4. 14] 31

COSC 3 P 92 32

COSC 3 P 92 JVM Instruction Formats 33

COSC 3 P 92 Memory area of IJVM 34

COSC 3 P 92 IJVM Instruction Set 35

COSC 3 P 92 Translating Java to IJVM 36

COSC 3 P 92 Implementation (cont) • See overheads (book page 234 -236) • Note: – each m. i. contains address of next instn – micro-assembler labels all instns appropriately, and must put them in right control store addresses (equiv. to opcode) – the sequenced instns may reside in any free area of control store! Microassembler auto sets ‘next address fields’. – only explicit ‘goto’s will override this sequencing • Two parts: – 1. fetch next byte for next instn (done at Main 1) – 2. branch to that opcode address and carry out instruction • Fetching instructions (Main 1) – PC always points to next instruction in Java application program – can be reset by branches (see goto 5, T, F, . . . ) – When Main 1 executed, assumed next opcode ready. the fetch at Main 1 is for next opcode. Hence instns must fetch it if necessary(eg. see bipush 2) 37

COSC 3 P 92 Implementation (cont) • Example 1: iadd (“pop 2 words from stack, push their sum”) – iadd 1: reads next-to-top word in stack (TOS register already contains top of stack word); bumps down the SP for writing result – iadd 2: sets TOS ready for addition (put in H) – iadd 3: add next-to-top value (read in iadd 1) to H, update TOS, save result in MDR for writing • Example 2: dup (“copy top stack word and push it”) – dup 1: incr SP pointer, copy to MAR – dup 2: save TOS (top stack word) to new SP, write it – note: can’t write it in dup 1, because both SP and MDR must be updated thru data path, and not both at once 38

COSC 3 P 92 Implementation (cont) • Example 3: goto offset (“unconditional branch”) – [Fig 4. 22] – – – – goto 1: save addr of opcode to OPC (old PC) goto 2: get the 2 nd byte of offset (1 st byte already in MBR) goto 3: shift 1 st byte left 8 bits goto 4: OR low byte into high byte goto 5: add 16 -bit offset to (old) PC; get next opcode goto 6: goto Main 1 Note: pause needed in goto 6 (must wait 2 extra cycle) 39

COSC 3 P 92 40

COSC 3 P 92 Improving performance • 1. Faster clock, transistors, electrical circuits • 2. simpler organization yields shorter clock cycles – eg. get rid of (B bus) decoder • 3. Merge interpreter loop with microcode (pt 2) – [4. 23], [4. 24] – saves extra cycles if done in all instns – significant speedup! • 4. Three-busses – [4. 25], [4. 26] – reduces need for separate instns to load H reg 41

COSC 3 P 92 42

COSC 3 P 92 2 Bus v. s. 3 Bus 43

COSC 3 P 92 Improving performance • 5. Instruction fetch unit [4. 27] – in Mic-1, ALU is used to increment PC and fetch instns – this uses up instn. cycles – IFU can be used: » 1. pre-fetches all instns outside of main data path » 2. pre-fetches operands: if they are required, they are there (else garbage, but ignored anyway) 44

COSC 3 P 92 Fetch Unit 45

COSC 3 P 92 Improving performance • Instruction fetch unit (cont) – shift register: always loaded with next bytes from memory – MBR 1 (1 byte, as before); and new MBR 2 (2 bytes) – values from shift reg dumped into both MBR 1, MBR 2 after every instn read; if needed, they are quickly put onto data path as req’d – need some fetching logic to know when to read more bytes into shift register, when to refresh MBR 1, MBR 2 – IMAR: separate memory addr reg (separate from MAR) » own dedicated incrementer (no need for ALU) – IFU must keep PC incremented properly, depending on instn length (if MBR 1, MBR 2 used) » branches may reset PC as well (from C) 46

COSC 3 P 92 Improving performance • Mic-2: – A, B buses – IFU – new IJVM [4. 30, See overheads] » smaller, faster » MBR 1 always has next opcode (due to IFU) 47

COSC 3 P 92 Mic-2 48

COSC 3 P 92 Improving performance: 6. Pipelining • divide instn. execution into modular steps and carry out different steps for seql. instns simultaneously • “instruction-level parallelism” • superscalar: single pipeline with parallel functional units • most instns take more than 1 cycle to complete • with pipelining: n instns in n cycles • To implement it: [4. 31] – add latch to A, B, C buses – they keep values stable during sub-cycles: can use values in 3 sections of the data path » (i) loading before ALU (A, B) » (ii) doing ALU, shift, and loading C latch » (iii) storing C back into registers 49

COSC 3 P 92 Mic-3 50

COSC 3 P 92 Improving performance: 6. Pipelining • need 3 cycles now to complete 1 instn – but maximum delay between all components is shorter (1/3) so can speed up clock – advantage: throughput -- 3 instns can be processed simult. – all parts of data path are busy. . . none are idle (usually) • best analogy: car factory assembly line 51

COSC 3 P 92 Pipelining (cont) • [4. 32, 4. 33, 4. 44] • interpreting instns in pipelined processor (Mic-4): – – new sub-cycles: microsteps takes 3 cycles to process instn (steps i, iii from earlier) call latches A, B, C (like registers) advantage [4. 33] is that different stages can work independently of one another now • more stages in pipeline means higher efficiency 52

COSC 3 P 92 53

COSC 3 P 92 54

COSC 3 P 92 Pipelining (cont) • One complication: memory reads – takes 2 cycles to get word from memory – hence a m. i. that uses a word in MDR must wait until it’s available – called a true or RAW (read after write) dependence – pipeline must stall until it is ready – ideally, put other m. i. instns in wait states • Another complication: conditional branches – cannot predict which instn to fetch/put into pipeline – have to “squash” or “flush” pipeline when a jump ruins sequence of instns 55

COSC 3 P 92 Pipelines and branch prediction • unconditional branches – fetch unit needs to know in advance where to access instns – a jump instn. isn’t decoded right away, and so F. U. won’t know branch location until later: called the delay slot – soln: compiler places other executable instns in delay, that it knows can be executed • conditional branches – dynamic prediction: carried out during run time – keep a running table of branched instn addresses, along with a “branch/no branch” bit – if branch in table, and branch bit set, then predict it will be taken --> fetch it – can use 2 prediction bits: predict it’s fetched twice, and not fetched twice (extra logic) 56

COSC 3 P 92 Pipelines and branch prediction • static branch prediction: carried out during compile time – if a loop nearly always done, then have a field in the instn. which tells CPU that branch should be fetched (eg. Ultra. SPARC) – can do simulations to determine how cond. branches executed 57

COSC 3 P 92 Improving performance: out-oforder exec, reg renaming • instruction ops can take varying # clock cycles – superscalar systems mean those functional units need more time to process their instns • problem: can’t exec one instn that requires results of another – means the pipeline stalls until register values are computed when subsequent instns require them. • soln: move instruction order, so that no idle waiting – overall exec must be identical to “linear” order • dependencies: – RAW (read after write): try to read reg before another instn has written it. – WAR (write after read): try to write before another has read it – WAW (write after write): both write simult. 58

COSC 3 P 92 In-order exec, in-order completion – decode in cyc n, exec n+1, writeback n+2 (except multiply in n+3) – 2 instns decoded simult. – uses scoreboard: 1 counter per reg keeping track of # instns using it as a source or destination – keeps track of max # regs that can be processed concurrently 59

COSC 3 P 92 Out-of-order exec, reg renaming (cont) – idea: execute instns so long as resources are available, and no conflicts – move order of instns to permit this – registers are renamed automatically to reduce conflicts: “secret regs” » eg. if a register is in conflict, rename it so conflict is removed. » copy values to original named reg later if required. – result: huge performance gain (we’re trying to make pipeline maximally useful!) 60

COSC 3 P 92 Improving performance: speculative exec • block: a section of sequential code [4. 45] • Can increase throughput by moving instructions beyond their blocks – hoisting: moving an instruction over a branch • speculative execution: executing an instruction before it is known whether it will be needed – OK to do it so long as there is no side effect (eg. write to memory, trap/interrupt) – may sometimes cause slowdown if spec. exec fetches an instn from memory that isn’t needed – otherwise, idea is to move slower instructions up the queue so that their processing can occur in the interim • some solns: – speculative instns: only fetch/exec instructions that are in the cache – poison bits: don’t set traps automatically; wait until that instn actually executed, and if a poison bit is set, then set the trap 61

COSC 3 P 92 Speculative exec 62

COSC 3 P 92 Example 1: Pentium II • 1. Fetch/decode [4. 46] – fetches instns and breaks them into m. i. ’s • 2. dispatch/exec – takes m. i. ’s and execs them • 3. retirement unit – completes exec, stores reg values (speculative exec) • 1, 2, 3 above act as high-level pipeline • ROB (reorder buffer): table of m. i. ’s to execute • Fetch/decode [4. 47] – – – 7 -stage pipeline multiple formats, sizes means instn decoding is involved analyzes instns to determine: size, branch-prediction usually between 1 and 4 m. i. ’s per ISA instn. uses reg renaming both static, dynamic branch prediction used • Dispatch/exec [4. 48] – 5 m. i. ’s can be exec’d at once 63

COSC 3 P 92 P 2 -micro architecture 64

COSC 3 P 92 65

COSC 3 P 92 Example 2: Ultra. SPARC II • [4. 49] • RISC: all instns are 3 -register microinstns already • branch prediction: (i) cache flags; (ii) 2 -bit prediction; (iii) compiler directions in instns • tries to exec 4 instns in parallel all the time – instns may be executed out of order • 9 -stage pipeline [4. 50] – split integer, float pipelines – int adds 2 stages (N 1, N 2) to keep it same as fp 66

COSC 3 P 92 Ultra. SPARC 67

COSC 3 P 92 Ultra. SPARC Pipeline 68

COSC 3 P 92 Example 3: pico. Java II • [4. 51] • instn, data caches are optional • register file (64 entries) – contains top 64 words of stack – dribbling: reg file read/written to memory when it gets too empty/full – “free” access, w/o accessing caches (which may not be used) 69

COSC 3 P 92 70

COSC 3 P 92 • 6 -stage pipeline [4. 52] – CISC instns – not superscalar: instns fetched, retired inorder (unlike Pentium II) • no branch prediction alg (economy) 71

COSC 3 P 92 Folding • Folding [4. 53, 4. 54, 4. 55] – replace a set of m. i. ’s with one m. i. – looks up patterns in a table [4. 55], and replaces with equivalent m. i. – only possible if operands are high in stack, in register file – huge gain in speed, like RISC performance 72

COSC 3 P 92 73

COSC 3 P 92 74

COSC 3 P 92 Comparing these examples • common features – – all m. i. ’s contain opcode, 2 source regs, dest reg 1 m. i. per cycle deep pipelines split instn and data caches • Pentium II: complexity is in deconstructing its CISC instns into micro-operations • JVM: complexity is in folding sets of m. i. ’s into single operations • Ultra. Sparc most straight-forward to implement, because instns require minimal decoding (all RISC instructions are micro-operations already!) 75

COSC 3 P 92 The end 76