Parallel architectures Computer Architectures M Parallelism 1 Architecture

Architecture • Architecture: functional behaviour of a computer. For instance a processor which executes

Parallelism Instruction level parallelism • Sequential Single instruction executed at a time • Pipelined

Parallelism architectures • Multicore (core level parallelism) Many processors in the same chip (i.

Deep Pipeline (Superpipeline) Fetch Decode Branch penalty Execute Memory Branch penalty Execute Writeback Memory

Parallel pipelines Sequential Time parallelism: pipeline Space parallelism: VLIW Space-time parallelism: (ie. I 5,

Diversified pipelines - 1 Dedicated pipelines. The instruction sequence is defined at compile-time. Careful

Diversified pipelines - 2 IF ”In order” execution ID RD Dispatch Buffer EX ALU

Floating Point DLX – F instructions Integer FP Multipl. IF ID MEM FP adder

DLX revisited • Very important structure change (more intermediate registers, more complex ID stage

DLX revisited • To cope with multiple write operations at the same time of

DLX revisited • In the previous case FLD F 2, 0(R 2) must be

Compiler Let’s consider this high level language statements X=Y+Z A=B*C to be executed in

Compiler Fetch F Dec. D Issue I Ex. E WBW But we can modify

Multicycle hazards I 1 I 2 I 3 I 4 I 5 I 6

Dynamic instructions scheduling • Temporal dependencies (hazards) not known at compile time • It

Scoreboard Write After Read (WAR) • Consider the following sequence FDIV F 0, F

Scoreboard Registers FP MUL FP DIV Functional units FP ADD INTEG Scoreboard The scoreboard

Scoreboard • The four stages equivalent to ID, EX and WB in DLX are:

An example Integer LD LD FMUL FSUB FDIV FADD F 6, 34(R 2) RAW

Scoreboard entities Instruction stages: emission, operands read, execution and writeback Statuses of the functional

Example (here we assume that F 0 is a “normal”register and not always “

Cycle 1 Read Execution Write At clock 1 the instruction state of Instruction status

Cycle 2 Read Execution Write Data ready in R 2: instruction Instruction status can

Cycle 3 Read Execution Write Instruction status j k Issue Op/Ex complete Result Instruction

Cycle 4 Instruction status j k Issue Instruction LD F 6 34 R 2

Cycle 5 Read Execution Write Instruction status k Issue Op/Ex complete Result Instruction j

Cycle 6 Read Execution Write Instruction status j k Issue. Op/Ex complete Result Instruction

Cycle 7 Read Execution Write Instruction status SUBD F 8 F 6, F 2

Cycle 8 Read EX Write Instruction status DIVD F 10 F 0, F 6

Cycle 9 - 10 Read Instruction status j k Issue Op/Ex Instruction LD F

Cycle 11 Instruction status j Instruction LD F 6 34 LD F 2 45

Cycle 12 Instruction status j Instruction LD F 6 34 LD F 2 45

Cycle 13 Instruction status j Instruction LD F 6 34 LD F 2 45

Cycle 14 Instruction status j Instruction LD F 6 34 LD F 2 45

Cycle 15 Instruction status j Instruction LD F 6 34 LD F 2 45

Cycle 16 Instruction status j Instruction LD F 6 34 LD F 2 45

Cycle 17 Instruction status j Instruction LD F 6 34 LD F 2 45

Cycle 18 Instruction status j Instruction LD F 6 34 LD F 2 45

Cycle 19 Instruction status j Instruction LD F 6 34 LD F 2 45

Cycle 20 Instruction status j Instruction LD F 6 34 LD F 2 45

Cycle 21 Instruction status j Instruction LD F 6 34 LD F 2 45

Cycle 22 Instruction status j Instruction LD F 6 34 LD F 2 45

Cycle 61 Instruction status j Instruction LD F 6 34 LD F 2 45

Cycle 62 Instruction status j Instruction LD F 6 34 LD F 2 45

Scoreboard limits • Register values must be read in any case in parallel only

Renaming – Tomasulo Algorithm «Renaming» indicates a location different from the RF where a

Tomasulo Algorithm Tomasulo eliminates not only WAWs but also WARs Possible WAW FLD FMUL

Tomasulo Algorithm Very high performance without special compilers Differences with scoreboard Buffer and controls

Tomasulo Algorithm In this example is it assumed that the MUL unit executes the

Tomasulo Algorithm • Load buffers are used to store the load addresses • Store

Tomasulo Algorithm Let’s see the scoreboard example in a Tomasulo Architecture. Let’s suppose that

Reservation Station Op: opcode of the instruction to be executed Vj, Vk: places where

Cycle 0 NB. For LD (ST here not used) there is a limited Instruction

Cycle 1 5 RS for the LOAD Instruction status Instruction j k Issue Execution

Cycle 2 5 RS for LOAD Instruction status Instruction j k Issue Execution Write

Cycle 3 Instruction status Instruction j k LD F 6 34 R 2 LD

Cycle 4 Instruction status Instruction j k LD F 6 R 2 34 LD

Cycle 5 Instruction status Instruction j k LD F 6 34 R 2 LD

Cycle 6 Instruction status Instruction j k LD F 6 34 R 2 LD

Cycle 7 Instruction status Instruction j k LD F 6 34 R 2 LD

Cycle 8 Instruction status Instruction j k LD F 6 34 R 2 LD

Cycle 9 Instruction status Instruction j k LD F 6 34 R 2 LD

Cycle 10 Instruction status Instruction j k LD F 6 34 R 2 LD

Cycle 11 Instruction status Instruction j k LD F 6 34 R 2 LD

Cycle 12 Instruction status Instruction j k LD F 6 34 R 2 LD

Cycle 15 Instruction status Instruction j k LD F 6 34 R 2 LD

Cycle 16 Instruction status Instruction j k LD F 6 34 R 2 LD

Cycle 56 Instruction status Instruction j k LD F 6 34 R 2 LD

Cycle 57 Instruction status Execution Write Instruction j k Issue complete Result LD F

A demo can be found at http: //www. ecs. umass. edu/ece/koren/architecture/Tomasulo 1/tomasulo_files/tomasulo. htm Parallelism

Limits of Tomasulo Algorithm • Very complex • Each CDB must be connected to

Exceptions • Exception/interrupt: non-programmed control transfer Ø Return address and all other information necessary

Examples Instruction Restart Parallelism 75

Precise exceptions/interrupts • Exceptions must be “precise” that is their behaviour must be same

Reorder Buffer (ROB) • FIFO queue • Stores pointers to all instructions in FIFO

Tomasulo in 4 steps • Emission— Emission of an instruction from the instruction queue

Program Counter Valid (terminated ) Exception? Result ROB • • FP Op Queue Res

Example LD FADD FDIV BRNE LD FADD ST Parallelism F 0, 10(R 2)3 cycles

Tomasulo with ROB – cycle 1 Dest. Source Instruction Completed? FP Op queue LD

Tomasulo with ROB – cycle 2 Dest. FP Op queue LD FADD FDIV BRNE

Tomasulo with ROB – cycle 3 Source Dest. Instruction Completed? FP Op queue LD

Tomasulo with ROB – cycle 5 Source Dest. FP Op queue ROB 7 F

Tomasulo with ROB – cycle 6 FP Op queue NB ST can start its

Register Renaming • But when an emitted instruction must use a register where can

An example with R 2 LD R 2, 10(R 5) ; R 2 -4

HW support for register renaming • Free/busy register table. Two solutions: one pool of

ROB «without Tomasulo» • Instructions are emitted as soon a free slot in the

ROB and speculation • Dynamic instruction execution granting precise interrupts which are checked at

Example - 1 WAW FLD FDIV FMUL FADD FLD F 4, 0(R 10) F

Tomasulo without ROB and with renaming (RES stations). Multiplication FU execute the divisions too.

CLOCK 1 Instruction status Instruction FLD F 4 FDIV F 8 FMUL F 4

CLOCK 2 Instruction status Instruction FLD F 4 FDIV F 8 FMUL F 4

CLOCK 3 Instruction status Instruction FLD F 4 FDIV F 8 FMUL F 4

CLOCK 4 Instruction status Instruction FLD F 4 FDIV F 8 FMUL F 4

Example - 2 Same instruction stream WAW 80000000: 80000004: 80000008: 8000000 C: 80000010: 80000014:

Initial situation Addr Op. Des Sorg Top free registers of the circular queues 1

CLOCK 2 Instruction status Instruction j k FLD F 4 0 R 10 FDIV

CLOCK 5 Instruction status Instruction FLD F 4 FDIV F 8 FMUL F 4

Slides: 103

Download presentation

Parallel architectures Computer Architectures M Parallelism 1

Architecture • Architecture: functional behaviour of a computer. For instance a processor which executes DLX code • Implementation: a logical network implementing the architecture. It is called also microarchitecture. There are many implementations of the same architecture. Example: family x 86 • Synthesis: a physical implementation. There are many possible synthesises of the same implementation (for instance different technologies) The architecture is defined by the machine language that is the instruction set (assembly language). Instruction Set Architecture -> ISA The ISA varies slowly while the implementation change rapidly (see for instance IA 8, IA 16, IA 32…). More an ISA remains more are the programs implemented on it and therefore compatibility becomes the main issue. Parallelism 2

Parallelism Instruction level parallelism • Sequential Single instruction executed at a time • Pipelined Multiple instructions executed simultaneously • Superpipelined Multiple stages for each operation (EX, MEM etc. ) in order to increase the clock frequency (i. e. Pentium IV) • Scalar A single pipeline • Superscalar Multiple pipelines; many instructions started at the same time. Possibile Out Of Order execution (run time decision) • Very Long Instruction Word Multiple pipelines; many instructions started at the same time. Instruction order decided at compile time • Superscalar superpipelined (i. e. Pentium IV, I 5, I 7 etc. ) ………. . Parallelism 3

Parallelism architectures • Multicore (core level parallelism) Many processors in the same chip (i. e. . Core duo – Nehalem – Sandy Bridge …. . ) • Multithread (thread level parallelism) Pipelines of the same processor used by different processes at the same time (time sharing) (as if it were a multicore – ex. Pentium IV, Nehalem, Sandy Bridge etc…. ) • Memory level parallelism A memory able to provide multiple data at different addresses at the same time (outstanding requests - DDR 2, DDR 3 etc. ) Parallelism 4

Deep Pipeline (Superpipeline) Fetch Decode Branch penalty Execute Memory Branch penalty Execute Writeback Memory Writeback • Each stage subdivided in three substages. . Higher clock frequency but higher branch penalty • Higher power consumption!!!!!! Parallelism 5

Parallel pipelines Sequential Time parallelism: pipeline Space parallelism: VLIW Space-time parallelism: (ie. I 5, I 7…) Parallelism 6

Diversified pipelines - 1 Dedicated pipelines. The instruction sequence is defined at compile-time. Careful compilation is fundamental in order to avoid an underexploitation of the pipelines. IF ID RD EX ALU MEM 1 FP 1 MEM 2 FP 2 BR F => Floating FP 3 WB Different execution times problem Instruction interdependency problem Multi instruction buffer to avoid pipelines block. Parallelism 7

Diversified pipelines - 2 IF ”In order” execution ID RD Dispatch Buffer EX ALU MEM 1 FP 1 MEM 2 FP 2 BR «Out Of Order» execution FP 3 Reorder Buffer WB Parallelism ”In order” retirement 8

Floating Point DLX – F instructions Integer FP Multipl. IF ID MEM FP adder FP/Int. Divid. multicycle stages Pipelined Integer FP Multiply WB Ex M 1 M 2 M 3 M 4 M 5 M 6 M 7 IF Parallelism ID A 1 A 2 A 3 A 4 FP Add MEM FP/INT. Divide (i. e. 24 clock cycles – one instruction at a time executed) WB 9

DLX revisited • Very important structure change (more intermediate registers, more complex ID stage to send each instruction to the appropriate execution stage) • Hazards problems: the instructions do not end in the same order of their issue. Data required for computing the address • Example FMUL F 1, F 2 (no interdependency between instructions in this sequence) FADD F 3, F 4, F 5 In violet the stages where the operands are needed FLD F 6, 10(R 8) and in green the stages where new results are FST 40(R 10), F 9 Data written produced FMUL FADD FLD FST IF ID M 1 M 2 M 3 M 4 M 5 M 6 M 7 IF ID A 1 A 2 A 3 A 4 MEM WB IF ID EX MEM WB (WB) nop Red squares: execution • Since the division is normally a single functional unity , up to 40 clocks stalls may occur in this case • Multiple instructions at the same time in the same stages (in particular in WB) • Write After Write hazards (WAW)– i. e. if a FADD F 6, F 4, F 5 (four EX cycles ) directly preceded a FLD F 6, 10(R 8) (one EX cycle) (although in this case the FADD would have been dropped by the compiler since useless) Same destination register • Instructions are not completed in order Write sequence error • Because of the different instructions execution times Read After Write (RAW - DLX) hazards are more frequent Parallelism 10

DLX revisited • To cope with multiple write operations at the same time of different registers the number of the input ports of the RF can be increased (expensive) or stalls must be introduced (normally in MEM or WB stages so as to choose the instructions to be stalled). More complex pipelines • RAW hazards are solved through the forwarding • For WAW hazards consider the following example FMUL F 0, F 4, F 6 …………. . FADD F 2, F 4, F 6 …………. . FLD F 2, 0(R 2) IF Multiple RF write operations M 6 M 7 MEM WB A 2 A 3 A 4 MEM WB ID EX MEM WB IF ID EX MEM ID M 1 M 2 M 3 M 4 M 5 IF ID EX MEM WB IF ID A 1 IF WB If FADD were started one clock later a Write After Write hazard would have taken place !! Hazards occur normally among homogeneous registers (FP or Integer) but for the FLOAD and FSTOR which use integer register for address computing Normally the hazards are detected in the ID stage considering the preceding and following instructions so as to introduce the required stalls (in this case FLD would have been stalled one clock) Parallelism 11

DLX revisited • In the previous case FLD F 2, 0(R 2) must be stalled until FADD F 2, F 4, F 6 has reached the MEM stage. It must be however assumed that between the two instructions there must at least one using through the forwarding the result of FADD F 2, F 4, F 6 otherwise the compiler would have dropped the instruction ! • The situation would have been even worse if FLD had been completed before the FADD. • In any case it is always possible that different instructions are completed in an order different from that of their issue • How can we grant that the final result is that of the program ? Parallelism 12

Compiler Let’s consider this high level language statements X=Y+Z A=B*C to be executed in a processor with the following pipeline Fetch F Dec. D In order emission Issue I Ex. E WBW The issue of the addition (multiply) is possible only AFTER the previous instruction execution calculating R 2 (R 5) that is after the last EX stage of R 2 <= Z (R 5 <= C) possibly with forwarding Addition result not yet ready RAW Busy decoder- RAW Multiply: waits for results Busy decoder Decoder busy Parallelism The issue is here possible since data to R 1 e R 2 have been already produced Stalls Decoder occupied Data not available D freed by the previous addition instruction At the end of this stage the addition result is available 13

Compiler Fetch F Dec. D Issue I Ex. E WBW But we can modify the emission without modyfying the result before Emission possible since R 1 and R 2 already available Waiting for R 5 after Busy decoder Parallelism 16 cicles instead of 22 !!!! Waiting for R 6 14

Multicycle hazards I 1 I 2 I 3 I 4 I 5 I 6 WAW(F 1) I 1 F 1 = F 2 + F 3 F 2 = F 4 x F 5 F 3 = F 3 + F 4 F 6 = F 6 x F 6 F 1 = F 3 + F 5 F 2 = F 3 + F 4 WAR (F 2) WAR (F 3) I 2 I 3 RAW (F 3) I 5 RAW (F 3) WAW(F 2) I 6 NB: in this graph the hazards are potential since the registers only are considered no matter how many cycles are required by the executions I 1 I 2 I 3 Let’s suppose to have a FP adder (1 cicle – in red) and a multiplier (3 cicles in green). I 4 I 5 I 6 Parallelism T 1 T 2 T 3 T 4 T 5 T 6 T 7 T 8 T 9 T 10 15

Dynamic instructions scheduling • Temporal dependencies (hazards) not known at compile time • It allows the execution of the code on different pipelines and on superscalar processors with no implications for the compiler. • It allows the execution of instructions ahead of their position (in the following case FSUB F 12, F 8, F 14) if the conditions allow it FDIV F 0, F 2, F 4 FADD F 10, F 8 (RAW - must wait for F 0) FSUB F 12, F 8, F 14 (can be executed anyway) • Systems with out of order executions but commitment always in Parallelism order 16

Scoreboard Write After Read (WAR) • Consider the following sequence FDIV F 0, F 2, F 4 FADD F 10, F 8 FSUB F 8, F 14 They must read the same value Read after Write (RAW) • There is an antidependency (WAR hazard) between FADD and FSUB: should FSUB end before FADD has read F 8 an error would occur (F 8 already updated) • A possible Write After Write (WAW) hazard would occur if in FSUB F 10 instead of F 8 had been used as destination (in case FSUB would end before FADD – but probably FADD dropped by the compiler) • “Scoreboard” technique: an instruction per clock should be terminated executing an instruction as soon as possible. Parallelism 17

Scoreboard Registers FP MUL FP DIV Functional units FP ADD INTEG Scoreboard The scoreboard is somehow equivalent to the ID stage (just after the fetch) and determines when an instruction can read its operands and start its execution. The scoreboard considers all system state changes and decides when the first instruction in the FIFO queue (as produced by the compiler) can be started. Parallelism 18

Scoreboard • The four stages equivalent to ID, EX and WB in DLX are: 1. Emission: if a functional unit for the instruction is available (free) the instruction is issued 2. 3. 4. • unless another functional unit has already an instruction which must write into the same destination register. No WAW hazards therefore. In this latter case the instruction is stalled which blocks the emission of all the following instructions in the prefetch queue even when all other conditions for them are met! Operand read: the instruction has been emitted. If the operand(s) is(are) available and no already executing instruction must write it(them), the operand(s) is(are) read otherwise stall in the functional unit Execution: when the result has been computed and stored the scoreboard is informed so as to unblock a possibly waiting instruction In case of possible WAR the instruction is stalled and does not write the result if there is a previous instruction which has not yet read the operands and one(both) of them is(are) the destination register(s) of the considered instruction. Once the operand(s) has(have) been read the result can be written It must be noticed that with this organisation the forwarding is avoided since the results are written as soon as produced (but for the wait WAR – point 4) • Obviously some stalls can be induced because the number of busses available for transfers is small The scoreboard technique allows to transfer instructions directly from EX to WB stage (reducing the RAW risks). Parallelism 19

An example Integer LD LD FMUL FSUB FDIV FADD F 6, 34(R 2) RAW F 2, 45(R 3) F 0, F 2, F 4 RAW F 8, F 6, F 2 F 10, F 6 < F 8, F 2 WAR F 6, (MULD) (SUBD) (DIVD) (ADDD) Do you find more hypothetical hazards? For instance what about F 0? Hypothetical timing for different instructions (which includes the operands read and execution) FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles Parallelism 20

Scoreboard entities Instruction stages: emission, operands read, execution and writeback Statuses of the functional units (FU): 9 parameters Busy Unit busy Op Operation Code presently executed Fi Instruction destination (result) register Fj, Fk Operands source registers Qj, Qk Functional units producing the required operands (if not yet ready) for the registers Fj and Fk Rj, Rk Flags (yes) indicating whether Fj, Fk have been already updated Result status register : indicates which functional unit will write each register. Void when no functional unit has to do with the specific register N. B. It must be remembered that in case of possible WAW the instructions emission is stalled (point 1 of the rules) N. B. In the following example we suppose that two multiplication/division units are available Parallelism 21

Example (here we assume that F 0 is a “normal”register and not always “ 0”) Instruction status Instruction j k LD R 2 F 6 34 LD R 3 F 2 45 MULTD F 0 F 2 F 4 SUBD F 8 F 6 F 2 DIVD F 10 F 6 ADDD F 6 F 8 F 2 Functional unit status Time Name Integer n. of clock cycles of Mult 1 execution yet Mult 2 to elapse Add Divide Register result status Issue Read Execution Write complete Result Op Instructions states Progression clock NB LD = MULTD = SUBD = DIVD = ADDD = FU=Functional Unit Busy Op dest Source 1 Source 2 FU for j FU for k Fi Fj Fk Qj Qk Parallelism FLD 1 cycle FADD, FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles Register Qi Ready ? Fj? Rj Fk? Rk Rj and Rk indicates whether (possibly in the next cycle if just produced) the data can be read from the operands source registers of the instruction which must be executed. Qjand Qk are the Functional Units which produce them (if not yet ready). Fj and Fk are the registers where data produced by Qj and Qk are stored (or will be stored in the next clock cycle – data available if the corresponding Ri is yes) to be used in the executed instruction Clock 0 1 integer unit 2 multipl. units 1 add/sub unit 1 division unit FLD FMUL FSUB FDIV FADD Floating point result registers F 0 F 2 F 4 F 6 F 8 F 10 F 12 . . . F 31 Functional Unit producing the result for the floating point register Fx (Qj, Qk) 22

Cycle 1 Read Execution Write At clock 1 the instruction state of Instruction status k Issue Op/Excomplete Result LD F 6, 34(R 2) is Issue Instruction j R 2 is supposed to be already available LD F 6 34 R 2 1 and therefore in the next clock can be LD F 2 45 R 3 used. LD uses the integer unit Brown colour MULTDF 0 F 2 F 4 SUBD F 8 F 6 F 2 for state change R 2 DIVD F 10 F 6 ADDD F 6 F 8 F 2 FUk S 1 S 2 FUj Fj? Fk? dest Functional unit status Fj Fk Qj Qk Rj Rk Time Name Busy Op Fi Integer Yes Load F 6 R 2 Yes Mult 1 No Mult 2 No Add No Divide No Register result status F 0 F 2 F 4 F 31 F 6 F 8 F 10 F 12. . . Clock 1 Integer FU Parallelism Functional unit used for producing the result in F 6 23

Cycle 2 Read Execution Write Data ready in R 2: instruction Instruction status can proceed: execution k Issue Op/Ex complete Result Instruction j NB: The second LD cannot be emitted LD F 6 34 R 2 1 2 because the only integer unit is busy LD F 2 45 R 3 and the same applies for MULTD and MULTDF 0 F 2 F 4 the following instructions because instructions must be emitted in order SUBD F 8 F 6 F 2 although their functional units are free! DIVD F 10 F 6 ADDD F 6 F 8 F 2 dest S 1 S 2 FUj FUk Fj? Fk? Functional unit status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F 6 R 2 Mult 1 No Mult 2 No Add No Divide No Register result status F 6 F 8 F 10 F 12. . . F 0 F 2 F 4 F 31 Clock 2 FU Integer Parallelism 24

Cycle 3 Read Execution Write Instruction status j k Issue Op/Ex complete Result Instruction FLD 1 cycle LD F 6 34 R 2 1 2 3 FADD FSUB 2 cycles LD F 2 45 R 3 FMUL 10 cycles FDIV 40 cycles MULTD F 0 F 2 F 4 SUBD F 8 F 6 F 2 DIVD F 10 F 6 ADDD F 6 F 8 F 2 dest S 1 S 2 FUj FUk Fj? Fk? Functional unit status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F 6 R 2 Mult 1 No Mult 2 No Add No Divide No Register result status F 6 F 8 F 10 F 12. . . F 0 F 2 F 4 F 31 Clock 3 FU Integer Parallelism 25

Cycle 4 Instruction status j k Issue Instruction LD F 6 34 R 2 1 LD F 2 45 R 3 MULTD F 0 F 2 F 4 SUBD F 8 F 6 F 2 DIVD F 10 F 6 ADDD F 6 F 8 F 2 Functional unit status Time Name Busy Integer Yes Mult 1 No Mult 2 No Add No Divide No Register result status F 0 Clock 4 FU Parallelism Read Execution Write Op/Excomplete Result 4 2 3 dest Op Fi Load F 6 F 2 F 4 The change of status of the FUs indicates their value at the clock positive edge ending the current cycle (future status). For instance the integer functional unit is freed at the end of cycle 4 together with the result writeback. LD F 6 34, R 2 disappears totally from scoreboard at the clock positive edge concluding the current cycle 4. S 1 S 2 FUj Fj Fk Qj R 2 FUk Qk F 6 F 8 F 10 F 12 Integer Register at the end of the period has been written Integer functional unit freed at the end of the period Fj? Rj Fk? Rk . . . F 31 26

Cycle 5 Read Execution Write Instruction status k Issue Op/Ex complete Result Instruction j LD F 6 34 R 2 1 2 3 4 At the beginning of cycle 5 the integer unit LD F 2 45 R 3 5 is already free and then LD F 2 45, R 3 can be emitted and start MULTD F 0 F 2 F 4 SUBD F 8 F 6 F 2 R 3 supposed already as in the previous case DIVD F 10 F 6 ADDD F 6 F 8 F 2 S 1 S 2 RUj RUk Rj? Rk? dest Functional unit status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F 2 R 3 Yes Mult 1 No Mult 2 No Add No Divide No Register result status F 0 F 2 F 4 F 6 F 8 F 10 F 12. . . F 31 Clock 5 RU Integer Parallelism The Integer Functional Unit must produce a new value for F 2 27

Cycle 6 Read Execution Write Instruction status j k Issue. Op/Ex complete Result Instruction LD F 6 34 R 2 1 2 3 4 LD F 2 45 R 3 5 6 MULTD F 0 F 2 F 4 6 SUBD F 8 F 6 F 2 DIVD F 10 F 6 ADDD F 6 F 8 F 2 S 1 S 2 dest Functional unit status Fj Fk Time Name Busy Op Fi Integer Yes Load F 2 R 3 Mult 1 Yes Mult F 0 F 2 F 4 Mult 2 No Add No Divide No Register result status F 0 F 6 F 8 F 2 F 4 Clock 6 FU Mult Integer Parallelism MULTD F 0 F 2, F 4 can start because its FU is free and the destination register is F 0 MULTD waits for F 2 F 4 supposed from the integer unit !!!! already present FUj FUk Fj? Fk? Qj Qk Rj Rk No Yes . . . F 31 Integer F 10 F 12 28

Cycle 7 Read Execution Write Instruction status SUBD F 8 F 6, F 2 can start because j k Issue Op/Ex complete Result the arithmetic FP sum/subtraction is Instruction LD F 6 34 R 2 1 2 3 4 free. LD F 2 45 R 3 5 6 7 MULTD stalled in the MULTD F 0 F 2 F 4 6 execution unit because F 2 SUBD F 8 F 6 F 2 7 not yet ready. SUBD needs F 2 DIVD F 10 F 6 ADDD F 6 F 8 F 2 S 1 S 2 FUj FUk Fj? Fk? dest Functional unit status Fj Fk Qj Qk Rj Rk Time Name Busy Op Fi Integer Yes Load F 2 R 3 Mult 1 Yes Mult F 0 F 2 F 4 Integer No Yes (NB : FP adder Mult 2 No executes Add Yes Subd F 8 F 6 F 2 Integer Yes No FP subtractions Divide No too) Register result status F 0 F 2 F 4 F 6 F 8 F 10 F 12. . . F 31 Clock 7 FU Mult Integer Add Parallelism 29

Cycle 8 Read EX Write Instruction status DIVD F 10 F 0, F 6 can start k Issue Op/Ex complete. Result because the divide FP FU is free Instruction j F 2 written allows MULTD and LD F 6 34 R 2 1 2 3 4 SUBD to read the operands during LD F 2 45 R 3 5 6 7 8 the next cycle MULTDF 0 F 2 F 4 6 F 2 available !! SUBD F 8 F 6 F 2 7 DIVD F 10 F 6 8 F 0 not yet available ADDD F 6 F 8 F 2 Functional unit status dest S 1 S 2 FUj FUk Fj? Fk? Fi Fj Fk Qj Qk Rj Rk Time Name Busy Op R 3 Integer Yes Load F 2 Mult 1 Yes Mult F 0 F 2 F 4 Yes Mult 2 No Add Yes Sub F 8 F 6 F 2 Yes Divide Yes Div F 10 F 6 Mult 1 No Yes Register result status Updated at the end of the cycle F 0 F 2 F 4 F 6 F 8 F 10 F 12. . . F 31 Clock 8 FU Mult 1 Add Divide Parallelism F 2 is written and therefore the integer unit is free 30

Cycle 9 - 10 Read Instruction status j k Issue Op/Ex Instruction LD F 6 34 R 2 1 2 LD F 2 45 R 3 5 6 MULTD F 0 F 2 F 4 6 9 SUBD F 8 F 6 F 2 7 9 DIVD F 10 F 6 8 ADDD F 6 F 8 F 2 Functional unit status Time Name Busy Op Integer No 10 clock Mult 1 Yes Mult 2 No 2 clock Add Yes Sub 40 clock Divide Yes Div Register result status F 0 F 2 Clock 9 -10 FU Mult 1 Parallelism EX Write complete Result 3 4 7 8 N. B. : MULTD and SUBD can read the operands because F 2 available (see cycle 8). DIVD is still stalled because of F 0. ADDD cannot start because SUBD uses the adder FU dest Fi S 1 S 2 FUj Fj Fk Qj F 0 F 2 F 4 F 8 F 10 F 6 F 2 F 0 F 6 Mult 1 F 4 F 6 F 8 F 10 Add Divide FUk Qk F 12 Fj? Rj Fk? Rk No Yes . . . F 31 31

Cycle 11 Instruction status j Instruction LD F 6 34 LD F 2 45 MULTD F 0 F 2 SUBD F 8 F 6 DIVD F 10 F 0 ADDD F 6 F 8 Read EX Write k Issue Op/Ex complete Result R 2 1 2 3 4 R 3 5 6 7 8 F 4 6 9 F 2 7 9 11 F 6 8 F 2 Functional unit status Time Name Busy Op Integer No 8 clocks more Mult 1 Yes Mult 2 No 0 Add Yes Sub Divide Yes Div Register result status F 0 F 2 Clock 11 FU Mult 1 Parallelism Nota: FU Add requires 2 cycles for the SUBD and therefore nothing happens in cycle 10 while MULTD still processes its data NB: ADDD will use the result of the SUBD but is not yet started because of SUBD (the FU is busy) dest S 1 S 2 FUj Fi Fj Fk Qj FUk Qk F 0 F 2 F 4 F 8 F 10 F 6 F 2 F 0 F 6 Mult 1 F 4 F 6 F 8 F 10 F 12 Add Divide Fj? Rj Fk? Rk No Yes . . . F 31 32

Cycle 12 Instruction status j Instruction LD F 6 34 LD F 2 45 MULTD F 0 F 2 SUBD F 8 F 6 DIVD F 10 F 0 ADDD F 6 F 8 Read EX k R 2 R 3 F 4 F 2 F 6 F 2 Write Issue Op/Ex complete. Result 1 2 3 4 5 6 7 8 6 9 11 12 7 9 8 Functional unit status Time Name Busy. Op Integer No 7 clocks more Mult 1 Yes Mult 2 No Add No Divide Yes Div Register result status F 0 F 2 Clock 12 FU Mult 1 FLD 1 cycle FADD and FSUB 2 c ycles FMUL 10 cycles FDIV 40 cycles SUBD ends freeing the FU. In the next period ADDD can start dest S 1 S 2 FUj Fi Fj Fk Qj FUk Qk Fj? Rj Fk? Rk F 0 F 2 F 4 F 10 F 6 Mult 1 No Yes F 4 F 6 F 8 F 10 F 12 Divide . . . F 31 F 8 is written and the ADD/SUB FU is freed Parallelism 33

Cycle 13 Instruction status j Instruction LD F 6 34 LD F 2 45 MULTD F 0 F 2 SUBD F 8 F 6 DIVD F 10 F 0 ADDD F 6 F 8 Fead EX k R 2 R 3 F 4 F 2 F 6 F 2 Write Issue. Op/Ex complete Fesult 1 2 3 4 5 6 7 8 6 9 11 12 7 9 8 13 Functional unit status Time Name Busy. Op Integer No 6 Clocks more Mult 1 Yes Mult 2 No Add Yes Add Divide Yes Div Register result status F 0 F 2 Clock 13 FU Mult 1 Parallelism FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles Now ADDD can start because SUBD has finished its execution and has freed the FU dest S 1 S 2 FUj Fi Fj Fk Qj FUk Qk Fj? Rj Fk? Rk F 0 F 2 F 4 F 6 F 10 F 8 F 2 F 0 F 6 Mult 1 Yes No Yes F 4 F 6 F 8 F 10 F 12 Add Divide . . . F 31 34

Cycle 14 Instruction status j Instruction LD F 6 34 LD F 2 45 MULTD F 0 F 2 SUBD F 8 F 6 DIVD F 10 F 0 ADDD F 6 F 8 Read EX k R 2 R 3 F 4 F 2 F 6 F 2 Write Issue. Op/Ex complete Result 1 2 3 4 5 6 7 8 6 9 11 12 7 9 8 13 14 Functional unit status Time Name Busy. Op Integer No 5 clocks more Mult 1 Yes Mult 2 No 2 Clocks more Add Yes Add Divide Yes Div Register result status F 0 F 2 Clock 14 FU Mult 1 Parallelism dest S 1 S 2 FUj Fi Fj Fk Qj FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles FUk Qk F 0 F 2 F 4 F 6 F 10 F 8 F 2 F 0 F 6 Mult 1 F 4 F 6 F 8 F 10 F 12 Add Divide Fj? Rj Fk? Rk No Yes . . . F 31 35

Cycle 15 Instruction status j Instruction LD F 6 34 LD F 2 45 MULTD F 0 F 2 SUBD F 8 F 6 DIVD F 10 F 0 ADDD F 6 F 8 4 1 Read EX k R 2 R 3 F 4 F 2 F 6 F 2 Write Issue. Op/Excomplete Result 1 2 3 4 5 6 7 8 6 9 11 12 7 9 8 13 14 Functional unit status Time Name Busy. Op Integer No Clocks more Mult 1 Yes Mult 2 No Add Yes Add Clock more Divide Yes Div Register result status F 0 F 2 Clock 15 FU Mult 1 Parallelism FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles ADDD requires two cycles and therefore no system status change dest S 1 S 2 FUj Fi Fj Fk Qj FUk Qk F 0 F 2 F 4 F 6 F 10 F 8 F 2 F 0 F 6 Mult 1 F 4 F 6 F 8 F 10 F 12 Add Divide Fj? Rj Fk? Rk No Yes . . . F 31 36

Cycle 16 Instruction status j Instruction LD F 6 34 LD F 2 45 MULTD F 0 F 2 SUBD F 8 F 6 DIVD F 10 F 0 ADDD F 6 F 8 Read EX k R 2 R 3 F 4 F 2 F 6 F 2 Write Issue. Op/Ex complete Result 1 2 3 4 5 6 7 8 6 9 11 12 7 9 8 13 14 16 Functional unit status Time Name Busy Op Integer No 3 clocks more Mult 1 Yes Mult 2 No Add Yes Add Divide Yes Div Register result status F 0 F 2 Clock 16 FU Mult 1 Parallelism FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles ADDD ended its EX stage while MULTD and DIVD keep executing dest Fi S 1 S 2 FUj Fj Fk Qj F 0 F 2 F 4 F 6 F 10 F 8 F 2 F 0 F 6 Mult 1 F 4 F 6 F 8 F 10 Add Divide FUk Qk F 12 Fj? Rj Fk? Rk No Yes . . . F 31 37

Cycle 17 Instruction status j Instruction LD F 6 34 LD F 2 45 MULTD F 0 F 2 SUBD F 8 F 6 DIVD F 10 F 0 ADDD F 6 F 8 Read EX k R 2 R 3 F 4 F 2 F 6 F 2 Write Issue Op/Ex complete Result 1 2 3 4 5 6 7 8 6 9 11 12 7 9 8 13 14 16 Functional unit status Time Name Busy Op Integer No 2 Clocks more Mult 1 Yes Mult 2 No Stalled because Add Yes Add WAR F 6 Divide Yes Div Register result status F 0 F 2 Clock 17 FU Mult 1 Parallelism FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles NB !!! ADDD stalled (cannot write) because of a WAR with DIVD on F 6. DIVD does not read F 6 because it waits for F 0 produced by MULTD (operands are read in parallel). MULT and DIVD keep executing dest Fi S 1 S 2 FUj Fj Fk Qj FUk Qk F 0 F 2 F 4 F 6 F 10 F 8 F 2 F 0 F 6 Mult 1 F 4 F 6 F 8 F 10 F 12 Add Divide Fj? Rj Fk? Rk No Yes . . . F 31 38

Cycle 18 Instruction status j Instruction LD F 6 34 LD F 2 45 MULTD F 0 F 2 SUBD F 8 F 6 DIVD F 10 F 0 ADDD F 6 F 8 Read EX k R 2 R 3 F 4 F 2 F 6 F 2 Write Issue Op/Excomplete Result 1 2 3 4 5 6 7 8 6 9 11 12 7 9 8 16 13 14 Functional unit status Time Name Busy Op Integer No 1 clock more Mult 1 Yes Mult 2 No Add Yes Add Divide Yes Div Register result status F 0 F 2 Clock 18 FU Mult 1 Parallelism FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles MULT still executing DIVD still stalled dest Fi S 1 S 2 FUj Fj Fk Qj FUk Qk F 0 F 2 F 4 F 6 F 10 F 8 F 2 F 0 F 6 Mult 1 F 4 F 6 F 8 F 10 F 12 Add Divide Fj? Rj Fk? Rk No Yes . . . F 31 39

Cycle 19 Instruction status j Instruction LD F 6 34 LD F 2 45 MULTD F 0 F 2 SUBD F 8 F 6 DIVD F 10 F 0 ADDD F 6 F 8 Read EX k R 2 R 3 F 4 F 2 F 6 F 2 Write Issue Op/Ex complete Result 1 2 3 4 5 6 7 8 6 9 19 7 9 11 12 8 16 13 14 Functional unit status Time Name Busy Op Integer No Mult 1 Yes Mult 2 No Add Yes Add Divide Yes Div Register result status F 0 F 2 Clock 19 FU Mult 1 Parallelism FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles MULT ends its execution, will write in cycle 20 (after 10 cycles) which will unblock DIVD and then ADDD dest Fi S 1 S 2 FUj Fj Fk Qj FUk Qk F 0 F 2 F 4 F 6 F 10 F 8 F 2 F 0 F 6 Mult 1 F 4 F 6 F 8 F 10 F 12 Add Divide Fj? Rj Fk? Rk No Yes . . . F 31 40

Cycle 20 Instruction status j Instruction LD F 6 34 LD F 2 45 MULTD F 0 F 2 SUBD F 8 F 6 DIVD F 10 F 0 ADDD F 6 F 8 Read EX k R 2 R 3 F 4 F 2 F 6 F 2 Write Issue Op/Ex complete Result 1 2 3 4 5 6 7 8 6 9 19 20 11 12 7 9 8 16 13 14 Functional unit status Time Name Busy Op Integer No Mult 1 No Mult 2 No Add Yes Add Divide Yes Div Register result status F 0 F 2 Clock 20 FU Parallelism FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles MULTD writes F 0 unblocking DIVD dest Fi S 1 S 2 FUj Fj Fk Qj FUk Qk F 6 F 10 F 8 F 2 F 0 F 6 F 4 F 6 F 8 F 10 F 12 Add Divide Fj? Rj Fk? Rk Yes. . . F 31 41

Cycle 21 Instruction status j Instruction LD F 6 34 LD F 2 45 MULTD F 0 F 2 SUBD F 8 F 6 DIVD F 10 F 0 ADDD F 6 F 8 Read EX k R 2 R 3 F 4 F 2 F 6 F 2 Write Issue Op/Excomplete Result 1 2 3 4 5 6 7 8 6 9 19 20 7 9 11 12 8 21 16 13 14 Functional unit status Time Name Busy Op Integer No Mult 1 No Mult 2 No Add Yes Add Divide Yes Div Register result status F 0 F 2 Clock 21 FU Parallelism DIVD reads both F 0 and F 6 (which could not be written by ADDD because of WAR) unblocking ADDD which can write F 6 in the next cycle dest Fi S 1 S 2 FUj Fj Fk Qj FUk Qk F 6 F 10 F 8 F 2 F 0 F 6 F 4 F 6 F 8 F 10 F 12 Add Divide Fj? Rj Fk? Rk . . . F 31 42

Cycle 22 Instruction status j Instruction LD F 6 34 LD F 2 45 MULTD F 0 F 2 SUBD F 8 F 6 DIVD F 10 F 0 ADDD F 6 F 8 Read EX k R 2 R 3 F 4 F 2 F 6 F 2 Write Issue. Op/Ex complete Result 1 2 3 4 5 6 7 8 19 20 6 9 7 9 11 12 8 21 13 14 16 22 Functional unit status Time Name Busy Op Integer No Mult 1 No Mult 2 No Add No Divide Yes Div Register result status F 0 F 2 Clock 22 FU Parallelism Now ADDD can write F 6 after the WAR hazards with DIVD disappeared. For 6 cycles ADDD couldn’t write F 6 although its result was available dest Fi S 1 S 2 FUj Fj Fk Qj FUk Qk F 10 F 6 F 4 F 6 F 8 F 10 F 12 Divide Fj? Rj Fk? Rk . . . F 31 43

Cycle 61 Instruction status j Instruction LD F 6 34 LD F 2 45 MULTD F 0 F 2 SUBD F 8 F 6 DIVD F 10 F 0 ADDD F 6 F 8 Read EX k R 2 R 3 F 4 F 2 F 6 F 2 Write Issue Op/Ex complete. Result 1 2 3 4 5 6 7 8 19 20 6 9 7 9 11 12 61 8 21 13 14 16 22 Functional unit status Time Name Busy Op Integer No Mult 1 No Mult 2 No Add No Divide Yes Div Register result status F 0 F 2 Clock 61 FU Parallelism DIVD execution ends after 40 cycles dest Fi S 1 S 2 FUj Fj Fk Qj FUk Qk F 10 F 6 F 4 F 6 F 8 F 10 F 12 Divide Fj? Rj Fk? Rk . . . F 31 44

Cycle 62 Instruction status j Instruction LD F 6 34 LD F 2 45 MULTD F 0 F 2 SUBD F 8 F 6 DIVD F 10 F 0 ADDD F 6 F 8 Read EX k R 2 R 3 F 4 F 2 F 6 F 2 Write Issue Op/Ex complete Result 1 2 3 4 5 6 7 8 19 20 6 9 7 9 11 12 8 21 61 62 13 14 16 22 Functional unit status Time Name Integer Mult 1 Mult 2 Add 0 Divide Register result status Busy No No No Clock F 0 62 Parallelism All executions ended Op dest Fi S 1 Fj S 2 Fk FUj Qj F 2 F 4 F 6 F 8 F 10 FUk Qk Fj? Rj Fk? Rk F 12 . . . F 31 FU 45

Scoreboard limits • Register values must be read in any case in parallel only from the register file (which means that they must have been already stored in the registers – no RAW problem) • An instruction can be emitted only if all previous instructions have been emitted WAR RAW Parallelism FDIV FADD FSTOR WAW FSUB FMUL F 0, F 2, F 4 F 6, F 0, F 8 F 6, 0(R 1) F 8, F 10, F 14 F 6, F 10, F 8 N. B Hazards of the sequence are only potential: their occurrence depends on the instructions execution time 46

Renaming – Tomasulo Algorithm «Renaming» indicates a location different from the RF where a requested data is produced/stored and can be obtained. The name «renaming» is used because it is as if the source registers of an instruction were renamed Tomasulo algorithm: “renaming” is based on the concept of “reservation stations” which are functional units buffers where instructions can be «parked» waiting for the availability of the requested Fu and the needed data. § A reservation station is a place of a FU where an instruction emitted from the instruction queue waits until the FU is free and the needed data arrive as soon as produced (N. B. before being written in the RF). For its operands EITHER the source register data OR the reservation stations producing them are indicated (whence renaming). The renaming occurs at run-time § A reservation station captures a required operand exactly when and where it is (not waiting until it is written avoiding the register file access). Similar to the case of forwarding § When multiple writes to the same register occur (WAW – possible only if multiple busses between FUs and RF are available) only the most recently produced data are written (for each register a TAG is used indicating the FU which has the right to write) The following benefits occur § Hazards detection and execution control are distributed (not grouped as for the Scoreboard) : only the information stored in the reservation stations of each functional unit determines whether an instruction can execute in the FU since the source (where the data is being produced - if not yet int the RF) and NOT the RF is indicated. RAW hazards are no more possible since the requested data are provided as soon as produced. The same for WAR (data are read by the reservation stations while written) § Results are transferred directly to the waiting FUs reservation stations without the necessity of reading the RF through the common data busses (multiple reservation stations in addition to RF register can be accessed at the same time when multiple busses are available) Parallelism 47

Tomasulo Algorithm Tomasulo eliminates not only WAWs but also WARs Possible WAW FLD FMUL FSUB FDIV FADD F 6, 32(R 2) F 2, 44(R 3) F 0, F 2, F 4 F 8, F 2, F 6 F 10, F 6 F 6, F 8, F 2 FLD FMUL FSUB FDIV FADD [T/F 6], 32(R 2) F 2, 44(R 3) F 0, F 2, F 4 F 8, F 2, [T] F 10, F 0, [T] F 6, F 8, F 2 Possible WAR. Renaming (functional unit producing the data) NB: When an instruction is inserted in a RS it is checked whether one or more of its operands are being produced elsewhere by other RS: if yes then renaming For the FADD a potential WAR with the FDIV could occur if FADD ended before FDIV has read its operands (in case of F 8 of FSUB and of F 2 of FLD they were both immediately available for FADD) but since FDIV points for F 6 to the RS of FLD F 6, 32(R 2) and not to RF the problem does not occur. The same holds for FSUB. As far as the WAW between FLD and FADD per F 6 is concerned the mechanism grants that only the most recent instruction in the RS using a destination register can write the register. Parallelism 48

Tomasulo Algorithm Very high performance without special compilers Differences with scoreboard Buffer and controls directly distributed in the FUs (there is no centralized control): buffers are called “reservation stations” Source registers names substituted by pointers to buffers of the reservation stations (if the requested data are being there produced) “Renaming”: a direct pointer to the sources and not to the register One ore more Common Data Bus for sending results to all FUs requiring them Load and Stores considered as FUs (a STORE can also be a source for a RS executing a LOAD) Parallelism 49

Tomasulo Algorithm In this example is it assumed that the MUL unit executes the DIVs too and that the ADD executes the SUBs too. LOAD and STORES are handled as other instructions For the data produced by the FUs In this example: 3 RS for add/sub 2 RS for mult/div 5 RS for store 5 RS for load In this example only one Data Bus. Please notice that the same Common Data Bus is used also by the RS waiting for data Each RS (more than one for each FU) stores an emitted instruction and for each operand either of two elements: either the operand value (i. e. read from RF) or the name of the RS which is producing it (renaming) Parallelism 50

Tomasulo Algorithm • Load buffers are used to store the load addresses • Store buffers contain the computed addresses and the data to be written in memory • Load and store must be executed in sequence if they are related to the same addresses. In the other cases it is possibile to anticipate the LOADs (never the STOREs) • In figure there are 3 phases (each one of which can last several clocks): • Emission: the instructions are extracted in order from the general instruction queue when there is a free RS for the requested FU (the only condition) otherwise the instruction queue stalls. Operands are extracted from RF or the producing FU as indicated. In case of WAW it must be determined which instruction must provide the data • Execution: if one ore more operands are not yet available CDB (s) must be monitored (data must be transferred over a bus anyway) in order to catch them (and their sources) as soon as available: RAW are therefore avoided (we are sure not to read stale data in the RF). • Writeback: as soon as a data is produced, it is tranferred over one CDB (when more than one are available) to the RF and to the RS waiting for it. Parallelism 51

Tomasulo Algorithm Let’s see the scoreboard example in a Tomasulo Architecture. Let’s suppose that the execution times are the same of the scoreboard (FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) – NB. “+1” for the writeback LD LD FMUL FSUB FDIV FADD Parallelism F 6, 34(R 2) F 2, 45(R 3) F 0, F 2, F 4 F 8, F 6, F 2 F 10, F 6 F 6, F 8, F 2 52

Reservation Station Op: opcode of the instruction to be executed Vj, Vk: places where the operands are read (either RF or the FUs producing them). If blank the data is produced by the corresponding Qj or Qk Qj, Qk: Functional units producing the results. A blank indicates that the source operands are already in Vj or Vk or that they are not required Busy: Busy FU Register File Status: Indicates which FU will write the register (if needed). A blank means that there are no instructions which must write the register and therefore its value can be directly used N. B. From the general instruction queue one instruction per clock is emitted when a FUs RS for that instruction is available otherwise stall. In our example we assume only one CDB. Parallelism 53

Cycle 0 NB. For LD (ST here not used) there is a limited Instruction status number of RS. Their BUSY status is here displayed Instruction j k Issue Execution Write differently from the FU (see next slide) LD F 6 34 R 2 Operands register. If blank the datum is produced in the LD F 2 45 R 3 corresponding Q FU MULTD F 0 F 2 F 4 SUBD F 8 F 6 F 2 Producing FU – if blank it means that the dat is in RF DIVD F 10 F 6 ADDD F 6 F 8 F 2 FLD 1+1 cycles, and FSUB 2+1 cycles, Reservation Stations S 1 S 2 RS for j. RS for k FADD FMUL 10+1 cycles, FDIV 40+1 cycles) Time. Name Busy Op Vj Vk Qj Qk NB. “+1” for the writeback 0 Add 1 No 0 Add 2 No For sake of simplicity Rj e Rk Load/store not Add 3 No (ready/notready) of the scoreboard indicated in the are not displayed since their values 0 Mult 1 No are implicit in the status of Qj and Qk status table 0 Mult 2 No Register result status Clock F 0 F 2 F 4 F 6 F 8 F 10 F 12. . . F 31 0 FU Parallelism The FU producing the new value 54

Cycle 1 5 RS for the LOAD Instruction status Instruction j k Issue Execution Write Busy Address LD F 6 34 R 2 Load 1 Yes 34+R 2 1 LD F 2 45 R 3 NB: Here it is assumed that R 2 and R 3 are already available MULTD F 0 F 2 F 4 SUBD F 8 F 6 F 2 DIVD F 10 F 6 ADDD F 6 F 8 F 2 Reservation Stations S 1 S 2 RS for j. RS for k Time. Name Busy Op Vj Vk Qj Qk 0 Add 1 No 3 RS for FLD 1+1 cycles, 0 Add 2 No FADD and FSUB 2+1 cycles, adder/sub FMUL 10+1 cycles, Add 3 No FDIV 40+1 cycles) 2 RS for NB. “+1” for the writeback 0 Mult 1 No mul/div 0 Mult 2 No Register result status Clock F 0 F 2 F 4 F 6 F 8 F 10 F 12. . . F 31 1 FU Load 1 Parallelism 55

Cycle 2 5 RS for LOAD Instruction status Instruction j k Issue Execution Write Busy Address 34 LD F 6 R 2 1 Load 1 Yes 34+R 2 2 LD F 2 45 R 3 2 The second LD is emitted. Load 2 Yes 45+R 3 One instruction per clock is MULTD F 0 F 2 F 4 emitted (when possible) SUBD F 8 F 6 F 2 NB: Load -> 2 cycles: the first one for computing DIVD F 10 F 6 the address and the second for reading the ADDD F 6 F 8 F 2 data Reservation Stations S 1 S 2 RS for RS j for k Time. Name Busy Op Vj Vk Qj Qk Add 1 No FLD 1+1 cycles, N. B. A second LOAD has been emitted Add 2 No FADD and FSUB 2+1 cycles, (not possible with the scoreboard) FMUL 10+1 cycles, Add 3 No FDIV 40+1 cycles) and parked in the RS. R 3 value NB. “+1” for the writeback already available in the RF Mult 1 No Mult 2 No Register result status Clock F 0 F 2 F 4 F 6 F 8 F 10 F 12. . . F 31 2 FU Load 2 Load 1 Parallelism 56

Cycle 3 Instruction status Instruction j k LD F 6 34 R 2 LD F 2 45 R 3 MULTD F 0 F 2 F 4 SUBD F 8 F 6 F 2 DIVD F 10 F 6 ADDD F 6 F 8 F 2 Reservation Stations Time. Name Busy Add 1 No Add 2 No Add 3 No Yet 10 cycles Mult 1 Yes Mult 2 No Register result status Clock 3 FU Parallelism Issue Execution Write 1 2 --3 2 33 MULTD emitted (free RS ) Busy Load 1 Yes Load 2 Yes Address 34+R 2 45+R 3 LD two cycles S 1 Vj Op S 2 Vk RS for j. RS for k Qj Qk MULTD can be emitted although F 2 NOT yet available. F 2 -> renaming Data supposed already in the RF Mult F 0 Mult 1 F 4 F 2 Load 2 F 4 FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. “+1” for the writeback Load 2 F 6 Load 1 F 8 F 10 F 12 . . . F 31 57

Cycle 4 Instruction status Instruction j k LD F 6 R 2 34 LD F 2 45 R 3 MULTD F 0 F 2 F 4 SUBD F 8 F 6 F 2 DIVD F 10 F 6 ADDD F 6 F 8 F 2 Reservation Stations Time. Name Busy Yet 3 cycles Add 1 Yes Add 2 No Add 3 No Yet 10 cycles Mult 1 Yes Mult 2 No Register result status Clock 4 FU Parallelism Issue Execution Write 1 2 --3 4 2 3 --4 3 4 Op Sub S 1 Vj Busy Load 2 Yes Address 45+R 3 The data read from memory LD F 6 34(R 2) is written both in the RF and in the RS of SUBD and MULTD which are waiting for it S 2 Vk RS for j. RS for k Qj Qk F 6 (captured on the fly)Load 2 FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. “+1” for the writeback SUBD is emitted (RS free) F 6 available in RF at the end of the cycle Mult F 0 Mult 1 F 4 F 2 Load 2 F 4 Load 2 F 6 The FUs execute both sums and subtractions F 8 F 10 Add 1 FU freed at the end of clock cycle F 12 . . . F 31 58

Cycle 5 Instruction status Instruction j k LD F 6 34 R 2 LD F 2 45 R 3 MULTD F 0 F 2 F 4 SUBD F 8 F 6 F 2 DIVD F 10 F 6 ADDD F 6 F 8 F 2 Reservation Stations Time. Name Busy 3 Add 1 Yes Cycles yet to be 0 Add 2 No executed for completing the Add 3 No execution 10 Mult 1 Yes 0 Mult 2 Yes Register result status Clock 5 FU Parallelism Issue Execution Write 1 2 --3 4 2 3 --4 5 3 4 DIVD is emitted (RS free) 5 The datum read from memory with LD F 2 45(R 3) is written both in register F 2 and in the RS of SUBD and MULTD which are waiting for it FLD 1+1 cycles, Op Sub S 1 S 2 RS for j. RS for k FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, Vj Vk Qj Qk FDIV 40+1 cycles) NB. “+1” for the writeback F 6 (capt. )F 2 (capt) Mult Div F 2 (capt) F 4 F 6 F 0 Mult 1 F 2 FU freed F 4 Mult 1 F 6 Wait for F 0 F 8 F 10 F 12 Add 1 Mult 2 . . . F 31 59

Cycle 6 Instruction status Instruction j k LD F 6 34 R 2 LD F 2 45 R 3 MULTD F 0 F 2 F 4 SUBD F 8 F 6 F 2 DIVD F 10 F 6 ADDD F 6 F 8 F 2 Reservation Stations Time. Name Busy 2 Add 1 Yes Cycles yet to be Add 2 Yes executed for completing the Add 3 No execution 9 Mult 1 Yes Yet 40 cycles. Mult 2 Yes Register result status Clock 6 FU Parallelism Issue Execution Write 1 2 --3 4 FLD 1+1 cycles, 2 3 --4 5 FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, 3 6 -FDIV 40+1 cycles) NB. “+1” for the writeback 4 6 -5 6 ADDD is emitted (RS free) S 1 S 2 RS for j. RS for k Op Vj Vk Qj Qk Sub F 6 (capt) F 2 Add 1 Wait for F 8 Mult Div F 2 F 0 Mult 1 F 2 F 4 Now MULTD can execute (F 2 and F 4 available) F 6 Mult 1 F 4 F 6 F 8 F 10 F 12 Add 1 Mult 2 Wait for F 0 . . . F 31 60

Cycle 7 Instruction status Instruction j k LD F 6 34 R 2 LD F 2 45 R 3 MULTD F 0 F 2 F 4 SUBD F 8 F 6 F 2 DIVD F 10 F 6 ADDD F 6 F 8 F 2 Reservation Stations Time. Name Busy 1 Add 1 Yes Add 2 Yes Add 3 No 8 Mult 1 Yes Yet 40 cycles. Mult 2 Yes Register result status Clock 7 FU Parallelism Issue Execution Write 1 2 --3 4 Data in F 6 will be overwritten by ADDD but it was already read and is 2 3 --4 5 present in the RS of DIVD 3 6 -4 6 -- 7 SUBD (as ADDD) two cycles 5 6 FLD 1+1 cycles, and FSUB 2+1 cycles, S 1 S 2 RS for j. RS for k FADD FMUL 10+1 cycles, FDIV 40+1 cycles) Op Vj Vk Qj Qk NB. “+1” for the writeback Sub F 6 (capt)F 2 (capt) F 2 Add 1 ADDD stalled waiting for SUBD (F 8) Mult Div F 2 F 0 Mult 1 F 2 F 4 F 6 F 4 Mult 1 F 6 F 8 F 10 F 12 Add 1 Mult 2 . . . F 31 61

Cycle 8 Instruction status Instruction j k LD F 6 34 R 2 LD F 2 45 R 3 MULTD F 0 F 2 F 4 SUBD F 8 F 6 F 2 DIVD F 10 F 6 ADDD F 6 F 8 F 2 Reservation Stations Time. Name Busy 0 Add 1 No 2 Add 2 Yes Add 3 No 7 Mult 1 Yes Yet 40 Mult 2 Yes Register result status Clock 8 FU Parallelism Issue Execution Write 1 2 --3 4 2 3 --4 5 NB: SUBD ends before MULTD and 3 6 -allows ADDD (which captures the result of F 8) to start executing 4 6 -- 7 8 5 6 FLD 1+1 cycles, S 1 S 2 RS for j. RS for k FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, Op Vj Vk Qj Qk FDIV 40+1 cycles) NB. “+1” for the writeback Add F 8 F 2 Mult Div F 2 F 4 F 6 F 0 Mult 1 F 2 F 4 Mult 1 F 6 Add 2 FU freed F 8 F 10 F 12 Mult 2 . . . F 31 62

Cycle 9 Instruction status Instruction j k LD F 6 34 R 2 LD F 2 45 R 3 MULTD F 0 F 2 F 4 SUBD F 8 F 6 F 2 DIVD F 10 F 6 ADDD F 6 F 8 F 2 Reservation Stations Time. Name Busy Add 1 No 2 Add 2 Yes Add 3 No 6 Mult 1 Yes Yet 40 Mult 2 Yes Register result status Clock 9 FU Parallelism Issue Execution Write 1 2 --3 4 2 3 --4 5 3 6 -4 6 -- 7 8 5 ADDD executing 6 9 -S 1 S 2 RS for j. RS for k Op Vj Vk Qj Qk Add F 8 F 2 Mult Div F 2 F 4 F 6 F 0 Mult 1 F 2 F 4 FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. “+1” for the writeback Mult 1 F 6 Add 2 F 8 F 10 F 12 Mult 2 . . . F 31 63

Cycle 10 Instruction status Instruction j k LD F 6 34 R 2 LD F 2 45 R 3 MULTD F 0 F 2 F 4 SUBD F 8 F 6 F 2 DIVD F 10 F 6 ADDD F 6 F 8 F 2 Reservation Stations Time. Name Busy Add 1 No 1 Add 2 Yes Add 3 No 5 Mult 1 Yes Yet 40 Mult 2 Yes Register result status Clock 10 FU Parallelism Issue Execution Write 1 2 --3 4 2 3 --4 5 3 6 -4 6 -- 7 8 Two execution cycles 5 6 9 -- 10 S 1 S 2 RS for j. RS for k Op Vj Vk Qj Qk Add F 8 F 2 Mult Div F 2 F 4 F 6 F 0 Mult 1 F 2 F 4 FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. “+1” for the writeback Mult 1 F 6 Add 2 F 8 F 10 F 12 Mult 2 . . . F 31 64

Cycle 11 Instruction status Instruction j k LD F 6 34 R 2 LD F 2 45 R 3 MULTD F 0 F 2 F 4 SUBD F 8 F 6 F 2 DIVD F 10 F 6 ADDD F 6 F 8 F 2 Reservation Stations Time. Name Busy Add 1 No Cycles yet to be 0 Add 2 No executed for completing the Add 3 No execution 4 Mult 1 Yes 40 Mult 2 Yes Register result status Clock 11 FU Parallelism Issue Execution Write 1 2 --3 4 2 3 --4 5 3 6 -4 6 -- 7 8 5 ADDD too ends before MULTD and DIVD 6 9 -- 10 11 S 2 RS for j. RS for k Op Vj Vk Qj Qk FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. “+1” for the writeback Mult Div F 2 F 0 Mult 1 F 2 F 4 F 6 Mult 1 F 4 F 6 FU freed F 8 F 10 F 12 Mult 2 . . . F 31 65

Cycle 12 Instruction status Instruction j k LD F 6 34 R 2 LD F 2 45 R 3 MULTD F 0 F 2 F 4 SUBD F 8 F 6 F 2 DIVD F 10 F 6 ADDD F 6 F 8 F 2 Reservation Stations Time. Name Busy Add 1 No Cycles yet to be Add 2 No executed for completing the Add 3 No execution 3 Mult 1 Yes 40 Mult 2 Yes Register result status Clock 12 FU Parallelism Issue Execution Write 1 2 --3 4 2 3 --4 5 FLD 1+1 cycles, 3 6 -FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, 4 6 -- 7 8 FDIV 40+1 cycles) NB. “+1” for the writeback 5 6 9 -- 10 11 S 2 RS for j. RS for k Op Vj Vk Qj Qk Waiting for the data produced by MULTD Mult Div F 2 F 0 Mult 1 F 2 F 4 F 6 Mult 1 F 4 F 6 F 8 F 10 F 12 Mult 2 . . . F 31 66

Cycle 15 Instruction status Instruction j k LD F 6 34 R 2 LD F 2 45 R 3 MULTD F 0 F 2 F 4 SUBD F 8 F 6 F 2 DIVD F 10 F 6 ADDD F 6 F 8 F 2 Reservation Stations Time. Name Busy Add 1 No Add 2 No Add 3 No 1 Mult 1 Yes Yet 40 Mult 2 Yes Register result status Clock 15 FU Parallelism Issue Execution Write 1 2 --3 4 2 3 --4 5 3 6 -- 15 4 6 -- 7 8 5 6 9 -- 10 11 S 2 RS for j. RS for k Op Vj Vk Qj Qk FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. “+1” for the writeback Waiting for the data produced by MULTD Mult Div F 2 F 0 Mult 1 F 2 F 4 F 6 Mult 1 F 4 F 6 F 8 F 10 F 12 Mult 2 . . . F 31 67

Cycle 16 Instruction status Instruction j k LD F 6 34 R 2 LD F 2 45 R 3 MULTD F 0 F 2 F 4 SUBD F 8 F 6 F 2 DIVD F 10 F 6 ADDD F 6 F 8 F 2 Reservation Stations Time. Name Busy Add 1 No 0 Add 2 No Add 3 No Now DIVD can execute 0 Mult 1 No 40 Mult 2 Yes Register result status Clock 16 FU Parallelism FU freed Issue Execution Write 1 2 --3 4 2 3 --4 5 3 6 -- 15 16 4 6 -- 7 8 5 6 9 -- 10 11 S 2 RS for j. RS for k Op Vj Vk Qj Qk Div F 0 F 6 F 0 F 2 F 4 F 6 F 8 FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. “+1” for the writeback F 10 F 12 Mult 2 . . . F 31 68

Cycle 56 Instruction status Instruction j k LD F 6 34 R 2 LD F 2 45 R 3 MULTD F 0 F 2 F 4 SUBD F 8 F 6 F 2 DIVD F 10 F 6 ADDD F 6 F 8 F 2 Reservation Stations Time. Name Busy Add 1 No Add 2 No Add 3 No Mult 1 No 0 Mult 2 Yes Register result status Clock 56 FU Parallelism Issue Execution Write 1 2 --3 4 2 3 --4 5 3 6 -- 15 16 4 6 -- 7 8 5 17 -- 56 6 9 -- 10 11 S 2 RS for j. RS for k Op Vj Vk Qj Qk Div F 0 F 6 F 0 F 2 F 4 F 6 F 8 F 10 F 12 Mult 2 . . . F 31 69

Cycle 57 Instruction status Execution Write Instruction j k Issue complete Result LD F 6 34 R 2 1 2 --3 4 LD F 2 45 R 3 2 3 --4 5 MULTD F 0 F 2 F 4 3 6 -- 15 16 SUBD F 8 F 6 F 2 4 6 -- 7 8 DIVD F 10 F 6 5 17 -- 56 57 ADDD F 6 F 8 F 2 6 9 -- 10 11 Reservation Stations S 1 S 2 RS for j. RS for k Time. Name Busy Op Vj Vk Qj Qk Add 1 No Add 2 No Add 3 No Mult 1 No Mult 2 No Register result status Clock F 0 F 2 F 4 F 6 F 8 F 10 57 FU Parallelism F 12 . . . F 31 70

A demo can be found at http: //www. ecs. umass. edu/ece/koren/architecture/Tomasulo 1/tomasulo_files/tomasulo. htm Parallelism 71

Limits of Tomasulo Algorithm • Very complex • Each CDB must be connected to each RS – Complex cabling – Reduce n. of CDB means reduced efficiency • If a single CDB is present only one instruction per cycle can end • Ouf of order instructions completion !!!!!! • NOT precise interrupts Parallelism 73

Exceptions • Exception/interrupt: non-programmed control transfer Ø Return address and all other information necessary to restore the interrupted situation must be saved Ø «Response» subroutine (handler) must be executed • Two exceptions types: interrupt and trap Ø Interrupts: external causes v The user program are interrupted and then restored v Asyncronous to the current process v Acknowledged at the end of the current instruction (if interrupts enabled) v The handler is responsibility of the user program • Parallelism Traps: internal causes v Exceptional conditions (overflow, zero division etc. ) v Errors (i. e. parity) v Page fault (or – see later – segment fault): data not available in memory v Syncronous to the current process v Operating systems handler v Instruction can be interrupted during its execution (i. e. page fault) and therefore must be «restartable» , . The executing program is normally temporarily aborted. 74

Examples Instruction Restart Parallelism 75

Precise exceptions/interrupts • Exceptions must be “precise” that is their behaviour must be same that would occur in a “non -pipelined” architecture • Precise: machine status is saved as if the code would have been executed until the exception : Ø All preceding instruction must be terminated Ø All instructions following the instruction which provoked the exception must be handled as if they never started Ø The same code must executed identically on different architectures • Complex problem with pipeline, OOO execution (see later) etc. • Scoreboard and Tomasulo have: In order emission, execution (and therefore terminated) out of order fuori ordine • Precise exceptions(interrupts) : instruction commitment in order Parallelism 76

Reorder Buffer (ROB) • FIFO queue • Stores pointers to all instructions in FIFO order as they are emitted. For sake of simplicity we say that the instruction is virtually inserted in the ROB • When instructions are terminated the results are stored in the ROB (instead of the RF) which provides also the operands to other instructions which requires them (renaming!) • Commitment: the results of the instruction which has reached the top slot of the FIFO are transferred to the architectural registers (registers which could be read by a test program) • Easy “undo” of speculated instructions (see later) or of branches erroneously predicted or exceptions • Automatic WAW avoidance ROB FP Op Queue Res Stations FP Adder Parallelism Commitment FP Regs Res Stations FP Adder 78

Tomasulo again Parallelism 79

Tomasulo in 4 steps • Emission— Emission of an instruction from the instruction queue when a RS and a ROB slot available. In the RS are indicated the operands source and the ROB slot where an instruction will be “parked” after its esecution (this phase is called «dispatch”). The results are NOT written in the RF until the commitment phase. NB the lack of one of the two conditions blocks the emission of the following instructions • Execution — Operands transformation. If not yet ready they can be in the ROB (in this case the operand values computed by the nearest previous instructions are used) or still computed in the FU. This phase is indicated as “issue”. • Result writeback — Execution ends. Result trasmitted on the CDB for the RS waiting for them and to the ROB. • Commitment—Architectural registers (or memory) update with the results stored in the ROB when the instruction is on the top of the ROB FIFO. In case of erroneously predicted branch the ROB results are just dropped (“graduation”). EMISSION IN ORDER COMMITMENT IN ORDER N. B. Sometimes more instructions can be commited simultaneously. If the destination is the same (unlikely, otherwise the compiler would have dropped the first one) the result of the most recent instruction is used. Parallelism 80

Program Counter Valid (terminated ) Exception? Result ROB • • FP Op Queue Res Stations FP Adder Compar network Destination Register HW with ROB Reorder Buffer FP Regs Res Stations FP Adder ROB is a circular queue Program counter i. e. used for branch Parallelism 81

Example LD FADD FDIV BRNE LD FADD ST Parallelism F 0, 10(R 2)3 cycles F 10, F 4, F 0 F 2, F 10, F 6 F 2, +100 F 4, 0(R 3) F 0, F 4, F 6 0(R 3), F 4 5 cycles 20 cycles 82

Tomasulo with ROB – cycle 1 Dest. Source Instruction Completed? FP Op queue LD FADD FDIV BRNE LD FADD ST ROB 7 ROB 6 F 0, 10(R 2) F 10, F 4, F 0 F 2, F 10, F 6 F 2, +100 F 4, 0(R 3) F 0, F 4, F 6 0(R 3), F 4 ROB 5 ROB 4 ROB 3 ROB 2 F 0 LD F 0, 10(R 2) FP registers ROB Cod. Position Op. ROB end Operands ROB Position N ROB 1 ROB top To memory Cod. Op. Operands From memory ROB Position FP adders Parallelism Reservation Stations 1 10+R 2 M 1 FP multipliers 83

Tomasulo with ROB – cycle 2 Dest. FP Op queue LD FADD FDIV BRNE LD FADD ST Source Instruction ROB 7 ROB F 0, 10(R 2) F 10, F 4, F 0 F 2, F 10, F 6 F 2, +100 F 4, 0(R 3) F 0, F 4, F 6 0(R 3), F 4 ROB 5 ROB 4 RAW ROB 3 Renaming !! Operands 2 FADD F 10, F 4, ROB 1 F 10 ROB 1 F 0 FADD F 10, F 4, F 0 [ROB 1] N ROB 2 LD Ex ROB 1 F 0, 10(R 2) ROB Position Parallelism Reservation Stations Top To memory Cod. Op. Operands Three slots for memory From memory operations ROB Position FP adders End ROB 6 FP registers ROB Cod. Position Op. Completed? 1 10+R 2 M 1 (Memory 2 clocks) FP multipliers There can be also two ROB sources 84

Tomasulo with ROB – cycle 3 Source Dest. Instruction Completed? FP Op queue LD FADD FDIV BRNE LD FADD ST ROB 7 ROB 6 F 0, 10(R 2) F 10, F 4, F 0 F 2, F 10, F 6 F 2, +100 F 4, 0(R 3) F 0, F 4, F 6 0(R 3), F 4 ROB 5 ROB 4 F 2 F 10 F 0 ROB 2 FDIV F 2, F 10 [ROB 2], F 6 N ROB 3 ROB 1 FADD F 10, F 4, F 0 [ROB 1] N ROB 2 LD Ex ROB 1 F 0, 10(R 2) FP registers ROB Cod. Position Op. Operands 2 FADD F 10, F 4, ROB 1 FP adders Parallelism End Top To memory Cod. Op. ROB Position 3 FDIV Reservation Stations Operands F 2, ROB 2, F 6 From memory ROB Position 1 10+R 2 M 1 FP multipliers 85

Tomasulo with ROB – cycle 5 Source Dest. FP Op queue ROB 7 F 0 ROB 5 F 4 -In cycle 4 (end of the first LD) F 2 FADD F 10, F 4, F 0 started executing F 10 Emitted in cycle 4 in parallel with LD F 4, 0(R 3) LD FADD FDIV BRNE LD FADD ST ROB Cod. Position Op. F 0, 10(R 2) F 10, F 4, F 0 F 2, F 10, F 6 F 2, +100 F 4, 0(R 3) F 0, F 4, F 6 0(R 3), F 4 Operands 2 FADD F 10, F 4, F 0 6 FADD F 0, ROB 5, F 6 Data captured on the fly. Not more present in the ROB Parallelism Completed? Instruction FP adders ROB 2 FADD F 0, F 4 [ROB 5], F 6 N ROB 6 LD F 4, 0(R 3) Ex ROB 5 BRNE F 2 [ROB 3], +100 N ROB 4 FDIV F 2, F 10 [ROB 2], F 6 N ROB 3 FADD F 10, F 4, F 0 Ex ROB 2 Completed and committed (F 0) (Updated by memory op ROB 1) End Top ROB 1 F 0 To memory Not yet committed Cod. Op. ROB Position 3 FDIV Reservation Stations Operands F 2, ROB 2, F 6 From memory ROB Position 5 0+R 3 M 1 FP multipliers 86

Tomasulo with ROB – cycle 6 FP Op queue NB ST can start its execution when LD F 4, 0(R 3) has terminated the execution NOT when is committed LD FADD FDIV BRNE LD FADD ST ROB Cod. Position Op. F 0, 10(R 2) F 10, F 4, F 0 F 2, F 10, F 6 F 2, +100 F 4, 0(R 3) F 0, F 4, F 6 0(R 3), F 4 Operands 2 FADD F 10, F 4, F 0 6 FADD F 0, ROB 5, F 6 FP adders Parallelism Dest. Source F 0 ROB 5 F 4 -F 2 F 10 ROB 3 ROB 2 Completed? Instruction ST 0(R 3), F 4[ROB 5] N ROB 7 FADD F 0, F 4 [ROB 5], F 6 N ROB 6 LD F 4, 0(R 3) Ex ROB 5 BRNE F 2 [ROB 3], +100 N ROB 4 FDIV F 2, F 10[ROB 2], F 6 N ROB 3 FADD F 10, F 4, F 0 Ex ROB 2 End Top ROB 1 FP registers (Updated by memory op ROB 1) F 0 To memory Cod. Op. ROB Position 3 FDIV Reservation Stations Operands F 2, ROB 2, F 6 From memory ROB Position 5 0+R 3 M 1 FP multipliers 87

Register Renaming • But when an emitted instruction must use a register where can it be found? In the ROB or in the RF ? The entire ROB should be analysed and the most recent slot found (if any) whose destination is the required register: the instruction should either point to it (if any) or to the RF. Complex and slow procedure • Solution: to use a number of physical registers greater than that of the architectural registers (the register known to the assembler language programmer - ISA) and to keep a pointer to the most recent (possibly not yet architectural) • Whenever an instruction inserted in the ROB must write a register (i. e. F 17), it points to a new physical register associate to the involved register (F 17) where the result will be temporarily stored. Any following instruction which must use register (F 17) will use that physical register • For each commitment the pointer to the architectural register points to the physical register linked the commited instruction. When a new instruction regarding the same architectural register is committed the pointer to it is changed (and the physical register previously embodying the architectural register is freed). Parallelism 88

An example with R 2 LD R 2, 10(R 5) ; R 2 -4 (destination. ) Circular queue of register R 2 MUL R 8, R 2, R 5 ; R 2 -4 (source. ) R 2 -0 RADD R 2, R 9, R 6 ; R 2 -5 (destination. ) R 2 -1 DIV R 2, R 10 ; R 2 -6 (destination) and R 2 -2 R 2 -5 (source) R 2 -3 (here commitment of instruction using R 2 -2) R 2 -4 R 2 -5 Pointer to the first free R 2 -6 register R 2 when LD R 2, R 2 -7 10(R 5) is emitted R 2 -8 Architectural register Let’ suppose that R 2 -2 and R 2 -3 are alredy occupied by previous not yet committed instructions first physical register free associated to R 2 • When LD R 2, 10(R 5) is emitted register R 2 -4 is given to it as destination which will be used by MUL (as soon as the new datum is computed). Now R 2 -2, R 2 -3 e R 2 -4 are «busy» and the first free register will be R 2 -5. R 2 -2, R 2 -3 ans R 2 -4 will be freed as soon the related instructions end. If the commitment is “in-order” all hazards disappear. R 2 -1 is the architectural R 2 register. R 2 -2 will become the architectural R 2 register at the commitment of the related instruction. The busy registers are freed when no more needed • No more distinction between register file and ROB locations. Normally there are 40 -120 physical registers Parallelism 89

HW support for register renaming • Free/busy register table. Two solutions: one pool of physical registers for all architectural registers or one pool for each architectural register. • Fast mapping between architectural and physical registers (run time) • Great number of physical registers • If no physical registers (circurarly) are available the instruction is stalled. There is no emission also if no free slot in the ROB is available and no RS is available Parallelism 90

ROB «without Tomasulo» • Instructions are emitted as soon a free slot in the ROB and a physical destination register are available using the register renaming • For each FU there is a virtual queue whose slots point to the ROB slots which require that FU. • The instruction of this queue are executed as soon as the required operands are available. • When two instructions are ready for execution, FIFO rule (so as to speed-up the commitment, always in order) Parallelism 91

ROB and speculation • Dynamic instruction execution granting precise interrupts which are checked at the instruction commitment always in order • Cancellation of speculative instructions when a branch is erroneously predicted Ø The prediction error must be revealed ASAP. The cancellation of post-branch instructions erroneously executed allows the preceding instructions to keep executing. The erroneously executed instructions are not yet commited Ø The early branch prediction avoids the execution of useless instructions (sometimes very time expensive). It must remembered that not only the ROB flush occurs but also the cancellation of all the instructions already in the pipeline Ø Need of a separated Return Stack Buffer for the speculative calls (otherwise the stack could be damaged). It is a separated stack whose content is copied onto the stack if the branch has been correctly predicted as taken. All instructions following a branch not yet commited use this stack. In case of misprediction the RSB content is cancelled Parallelism 92

Example - 1 WAW FLD FDIV FMUL FADD FLD F 4, 0(R 10) F 8, F 0, F 4, F 2, F 3 F 4, F 4 F 6, F 10, F 4, 0(R 5) RAW WAW RAW Same execution times as in the previous Tomasulo example Parallelism 93

Tomasulo without ROB and with renaming (RES stations). Multiplication FU execute the divisions too. Instruction status Instruction FLD F 4 FDIV F 8 FMUL F 4 FADD F 6 FLD F 4 j k 0 F 2 F 4 F 10 0 R 10 F 4 F 3 F 4 R 5 Exe. Write Issue Compl. Result Busy Address Load 1 Load 2 Store 1 Store 2 Reservation Stations Time Name Busy Op Vj Vk Qj Qk Add 1 Add 2 Add 3 Mult 1 Mult 2 Register result status Clock 0 Parallelism F 2 F 4 F 6 F 8 F 10 FU Three RS for LOAD, 2 for STORE, 2 for MUL/DIV 94

CLOCK 1 Instruction status Instruction FLD F 4 FDIV F 8 FMUL F 4 FADD F 6 FLD F 4 j k 0 F 2 F 4 F 10 0 R 10 F 4 F 3 F 4 R 5 Exe. Write Issue Compl. Result 1 Busy Address Load 1 yes R 10 Load 2 Store 1 Store 2 Reservation Stations Time Name Busy Op Vj Vk Qj F 6 F 8 F 10 Qk Add 1 Add 2 Add 3 Mult 1 Mult 2 Register result status F 2 Clock 1 Parallelism FU F 4 Load 1 Three RS for LOAD, 2 for STORE, 2 for MUL/DIV 95

CLOCK 2 Instruction status Instruction FLD F 4 FDIV F 8 FMUL F 4 FADD F 6 FLD F 4 j k 0 F 2 F 4 F 10 0 R 10 F 4 F 3 F 4 R 5 Exe. Write Issue Compl. Result 1 2 Busy Address 2 - Load 1 yes R 10 Load 2 Store 1 Store 2 Reservation Stations Time Name Busy Op Vj Vk Qj Qk Add 1 Add 2 Add 3 Mult 1 yes Mult 2 Register result status F 2 Clock 2 Parallelism FU div F 4 Load 1 F 0 F 6 Load 1 F 8 F 10 Mult 1 Three RS for LOAD, 2 for STORE, 2 for MUL/DIV 96

CLOCK 3 Instruction status Instruction FLD F 4 FDIV F 8 FMUL F 4 FADD F 6 FLD F 4 j k 0 F 2 F 4 F 10 0 R 10 F 4 F 3 F 4 R 5 Exe. Write Issue Compl. Result 1 2 Busy Address 2 -3 Load 1 yes R 10 Load 2 3 Store 1 Store 2 Reservation Stations Time Name Busy Op Vj Vk Qj Qk Add 1 Add 2 Add 3 Mult 1 yes Mult 2 yes Register result status F 2 Clock 3 Parallelism FU div F 0 mul F 2 F 3 F 4 F 6 F 8 Mult 2 Load 1 F 10 Mult 1 Three RS for LOAD, 2 for STORE, 2 for MUL/DIV 97

CLOCK 4 Instruction status Instruction FLD F 4 FDIV F 8 FMUL F 4 FADD F 6 FLD F 4 j k 0 F 2 F 4 F 10 0 R 10 F 4 F 3 F 4 R 5 Exe. Write Issue Compl. Result Busy Address 1 2 2 -3 4 4 - Load 2 3 4 - Load 2 Load 1 Store 1 Stalled for lack of free RS until cycle 13 (end of the preceding multiplication – only two slots in the multiply FU) blocking the emission of FADD which could be executed since there are two free slots in the corresponding RS. Reservation Stations Time Name Busy Op Vj Vk div F 0 F 4 mul F 2 F 3 F 4 F 6 F 8 Qj Store 2 Qk Add 1 Add 2 Add 3 Mult 1 yes yet 9 cycles Mult 2 yes Register result status F 2 yet 39 cycles Clock 4 Parallelism FU Mult 2 F 10 Mult 1 Three RS for LOAD, 2 for STORE, 2 for MUL/DIV 98

Example - 2 Same instruction stream WAW 80000000: 80000004: 80000008: 8000000 C: 80000010: 80000014: FLD FDIV FMUL FADD FLD F 4, 0(R 10) F 8, F 0, F 4, F 2, F 3 F 4, F 4 F 6, F 10, F 4, 0(R 5) RAW WAW RAW ROB and register renaming. The instructions are in any case inserted in the ROB when a free slot and a physical register (one of the many associated to the same architectural register) is available and then executed when the FU and the operands are available (policy of all modern processors). By so doing instructions are not only terminated OOO (but with results reordered in the ROB) but also emitted even if the FU is not available The execution is totally OOO but with an In-Order commitment Parallelism 99

Initial situation Addr Op. Des Sorg Top free registers of the circular queues 1 2 3 4 5 F 4 P 0 Renaming registers for F 4, F 6 e F 8 Free P 1 P 2 P 3 P 4 P 5 Free Arch F 6 Q 0 Busy Q 1 Q 2 Q 3 Q 4 Q 5 These are the architectural registers which a program monitor would display Free Arch F 8 Z 0 Busy These are registers in use by not yet committed instructions. They will become architectural registers when the related instructions are committed Parallelism Z 1 Z 2 Z 3 Z 4 Free ROB RAT Z 5 Arch Register Allocation Table Here we assume that the instruction using Z 0 precedes the instruction using Q 0. RAT for R 5, R 10, F 2, F 10 not displayed 100

CLOCK 1 Instruction status Instruction FLD F 4 FDIV F 8 FMUL F 4 FADD F 6 FLD F 4 j 0 F 2 F 4 F 10 0 k R 10 F 4 F 3 F 4 R 5 Exe. Write Issue Compl. Result 1 Busy Address Load 1 yes R 10 Load 2 Load 3 Addr Store. Op. 1 2 80000000 Store FLD Des P 0 Sorg 0, R 10 Renaming: the first available register for F 4 is used 2 3 4 F 4 Time Name Busy Add 1 Add 2 Add 3 Mult 1 Mult 2 Op Vj Vk Qj Parallelism 5 Qk P 0 P 1 P 2 P 3 P 4 P 5 Busy Free Arch F 6 Q 0 Q 1 Q 2 Busy Free Q 3 Q 4 Q 5 Free Arch F 8 Z 0 Clock 1 1 Z 2 Busy Free Z 3 Z 4 Z 5 Free Arch ROB RAT 102

CLOCK 2 Instruction status Instruction j k FLD F 4 0 R 10 FDIV F 8 F 0 F 4 FMUL F 4 F 2 F 3 FMUL F 4 F 4 FADD F 6 F 10 F 4 FLD F 4 0 R 5 Most recently attributed physical Busy Address register for F 4 Load 1 yes R 10 Exe. Write Issue Compl. Result 1 2 2 - Load 2 Addr. Load 3 Op. Store. FLD 1 80000000 Des Sorg P 0 0, R 10 1 Store. FDIV 2 Z 1 80000004 F 0, P 0 2 3 4 5 F 4 Time Name Busy Op Vj Vk Qj P 1 P 2 Busy Free Add 1 P 3 P 4 P 5 Free Arch F 6 Add 2 Q 0 Add 3 Mult 1 yes P 0 Qk Q 1 Q 2 Q 3 Q 4 Busy Free div F 0 P 0 Arch ROB RAT F 8 Z 0 Mult 2 Q 5 Z 1 Z 2 Z 3 Z 4 Z 5 Busy Free Arch Clock 2 Parallelism Renaming 103

CLOCK 3 Instruction status Instruction FLD F 4 FDIV F 8 FMUL F 4 FADD F 6 FLD F 4 j k 0 F 2 F 4 F 10 0 R 10 F 4 F 3 F 4 R 5 Exe. Write Issue Compl. Result Busy Address 1 2 -3 2 waiting for F 4 (P 0) Load 1 yes 3 Load 3 R 10 Load 2 Addr Store. Op. 1 Store 2 80000000 FLD Des Sorg P 0 0, R 10 1 80000004 FDIV Z 1 F 0, P 0 2 80000008 FMUL P 1 F 2, F 3 3 4 5 Time Name Busy Op Vj Vk Qj P 0 Add 1 Parallelism P 2 P 3 P 4 P 5 ROB RAT F 6 Add 3 Clock 3 P 1 Busy. Free Arch Add 2 Mult 1 yes Mult 2 yes Qk F 4 Q 0 div F 0 mul F 2 P 0 F 3 Previous instruction using Z 0 has been committed Z 0 is now the architectural register Q 1 Q 2 Q 3 Q 4 Q 5 Busy Free Arch F 8 Z 0 Z 1 Arch Busy Z 2 Z 3 Z 4 Z 5 Free 104

CLOCK 4 Instruction status Instruction FLD F 4 FDIV F 8 FMUL F 4 FADD F 6 FLD F 4 j k 0 F 2 F 4 F 10 0 R 10 F 4 F 3 F 4 R 5 Exe. Write Issue Compl. Result 1 2 2 -3 3 4 - Ended but not yet committed ! 4 Load 1 Load 2 Load 3 4 - Store 1 Load 2 Op. Des 2 1 P 0 Store 80000000 Store FLD 4 Addr Not yet executable but however inserted in the ROB It does not block the emission of the following instructions Time Name Busy Op Add 1 Add 2 Vj Vk Qj Instruction using Q 0 has ended its execution Q 0 is now the architectural register Yet 9 cycles Mult 1 yes Mult 2 yes F 0 mul F 2 F 0, P 0 2 80000008 FMUL P 1 F 2, F 3 3 8000000 C FMUL P 2 P 1, P 1 4 Qk F 4 Parallelism P 2 P 3 P 4 P 5 Busy Free Arch ROB RAT F 6 Arch Q 1 Integer Q 2 Q 3 Q 4 Q 5 Free Free F 8 Z 0 Clock 4 1 FDIV 2 Z 1 Store P 1 P 0 Qk P 0 F 3 0, R 10 80000004 Q 0 div Sorg 5 Add 3 Yet 39 cycles Busy Address Z 1 Arch Busy Z 2 Z 3 Z 4 Z 5 Free 105

CLOCK 5 Instruction status Instruction FLD F 4 FDIV F 8 FMUL F 4 FADD F 6 FLD F 4 j k 0 F 2 F 4 F 10 0 R 10 F 4 F 3 F 4 R 5 Exe. Write Issue Compl. Result 2 -3 1 2 yes 4 Load 1 Load 2 Load 3 3 - 3 4 waiting for F 4 (P 1) 4 Addr 5 FLD commited: the architectural register F 4 is now P 0 Time Name Busy Op Add 1 Busy Address add Vj Vk F 10 Qj P 2 Yet 8 cycles 80000004 FDIV 2 Z 1 Store F 0, P 1 2 80000008 FMUL P 2 F 2, F 3 3 8000000 C 80000010 FMUL P 2 P 1, P 1 4 FADD Q 1 F 10, P 2 5 P 5 ROB RAT Qk F 4 P 0 Qk P 1 Q 0 div F 0 mul F 2 P 0 Parallelism P 2 P 3 F 3 Q 1 Integer Q 2 Q 3 Arch Busy Free P 4 Q 5 Free F 8 Z 0 Clock 5 1 F 6 Add 3 Mult 1 yes Mult 2 yes Sorg Arch Busy Free Add 2 Yet 38 cycles Store 1 Load 2 Op. Des Store 2 1 Store Z 1 Arch Busy Z 2 Z 3 Z 4 Z 5 Free 106