P 6 Architecture Computer architecture M 1 PIPELINE

PIPELINE IFU 1 RAT = Register Allocation Table ROB = Re. Order Buffer IFU

Behaviour • Instruction extraction from the prefetch queue (a small set of instructions already

Pipeline stages IFU 1: (Instruction Fetch Stage 1) loads the 2 x 16=32 bytes

Pipeline stages ROB: loads three m-operations per clock into its buffer. If all m-operations

IFU 1 -IFU 2 stages IFU 1 It transfers a 16 bytes line from

Branch The BTB is made of a 4 -ways set-associative cache with 512 entries

Pipeline Prefetch Buffer IFU 1 IFU 2 IN ORD. Instruction lenght detection IFU 3

IFU 3 stage • It prepares the instructions for the three decoders of stage

Decoding Fetch and Aligning IFU 1 -IFU 2 -IFU 3 16 bytes DEC 1

DEC 1 and DEC 2 Stages DEC 1 Decoder 0 complex decodes IA instructions

Static BTB The P 6 uses a static BTB in the stage DEC 2

RAT stage 0 1 2. . . . 39 Register Allocation Table (Register Renaming)

ROB stage • The m-ops with the registers renamed in the RAT stage are

EX Stage (5 functional units only) Mov EAX, Mem PORT 2 Typical ìnstructions PORT

Instructions and m-ops execution IFU 1, IFU 2, IFU 3 DEC 1, DEC 2

m-ops in the ROB • m-ops states in the ROB: Ø Ø Ø SD:

RESET Initial JUMP Prefetch Streaming Buffer (32 bytes) AL RESET ID queue Decoders RAT/ROB

RESET –IFUi stages Prefetch Streaming Buffer (IFU 1) (stores 32 bytes – a cache

RESET Prefetch Streaming buffer ID queue Decoders RAT/ROB JMP MIS Memory Status address m-op

RESET –DECi stages • The detected instructions are decoded by DEC 1. • DEC

RESET Prefetch Streaming buffer RESET Decoders ID queue Branch m-op MIS Memory Status address

RESET – RAT stage • The m-op is extracted from the queue of the

RESET Prefetch Streaming buffer RESET ID queue Decoders RAT/ROB Branch m-op MIS Memory Status

RESET – ROB and RS stage • The m-op is then sent to the

RESET Prefetch Streaming buffer ID queue Decoders MIS Memory Status address 0 FFFF 0

RESET – execution and retirement • The RS after a branch execution informs the

Instructions execution Stato 0 1 2 3 4 5 6 7 8 9 10

ROB – description (1) 13. This is the ROB oldest m-op which corresponds to

ROB – description (2) 20. This m-op is the only one generated by the

ROB – description (3) ……………………………………. 1. This m-op derives form the one-byte IA instruction

After retiring 13, 14, 15 Stato 0 1 2 3 4 5 6 7

Slides: 32

Download presentation

P 6 Architecture Computer architecture M 1

PIPELINE IFU 1 RAT = Register Allocation Table ROB = Re. Order Buffer IFU 2 8 clocks IFU 3 DEC 1 BUS interface management (in order) DEC 2 Renaming RAT ROB DIS Dispatcher (issues the u-ops- Risc type ) EX RET 1 RET 2 Variable number of clocks Execution mechanism (Out-Of-Order) Results handling (in order) Between the three main sections compensation queues are inserted. The machine instructions are rotated in order to align them to the decoders. Superpipelined processor (number of stages greater than necessary in order to increase the clock frequency) 2

Behaviour • Instruction extraction from the prefetch queue (a small set of instructions already extracted from the cache ) • Instruction decoding and alignment (in order) • Machine instructions translation into RISC m-operations (m-ops) – fixed lenght 118 bit (RISC - in order) • m-operations insertion in the ROB (in order) • Out-of-order m-operations execution for functional modules use - optimization • In order results transfers to the machine registers (commitment) 3

Pipeline stages IFU 1: (Instruction Fetch Stage 1) loads the 2 x 16=32 bytes buffer directly from L 1 cache. While one buffer transfers data to IFU 2 the other is loaded by L 1 IFU 2: (Instruction Fetch Stage 2) detects the instructions boundaries (CISC) for the IFU 3. If a branch is detected it is forwarded to the BTB IFU 3: (Instruction Fetch Stage 3) sends the instructions to the appropriate decoders (see later) DEC 1: (Decoder Stage 1) transforms the machine instructions into moperation (118 bit wide). Up to three IA 32 instructions per clock can be processed. For very complex machine instructions a sequencer is used. The m-operations consist of two sources and one destination plus op-code (RISC) DEC 2: (Decoder Stage 2) transfers the m-operations to the decoded instruction queue. Sometimes for very complex instructions (for instnce string instructions) many clocks are requested to complete the operation since the m-instruction queue accepts up to 3 elements per clock. Micro Instruction Sequencer. It includes a second BTB (static – see later) RAT: (Register Allocation. Table) 40 more registers which can be globally allocated 4

Pipeline stages ROB: loads three m-operations per clock into its buffer. If all m-operations required data are already available (produced by preceding ROB moperations or already available in the machine registers) and a free slot in the RS queue (Reservation Station of the required functional unit) the m-operation is inserted (here the RS is different from Tomasulo’s. Here in the RS only ready m-operations that is the required operands are already available). DIS: (DISpatch Stage) if the m-operations in the previous clocks were not inserted into the RS because of lack of the necessary data or slots, inserts the m-operation as soon as the required conditions are met EX: (EXecution Stage) executes the m-operation. The number of clocks necessary depends on the m-operation. Several m-operations are executed in a single clock period. Functional modules RET 1: (RETirement Stage 1). When a m-operation has been executed and all the preceding conditional branches have been solved, attaches a readyfor-retirement tag to the m-operation RET 2: (RETirement Stage 2). It transfers the results to the architectural machine destination registers when all the preceding machine level instructions have been already committed. Up to 3 m-ops per clock are retired 5

IFU 1 -IFU 2 stages IFU 1 It transfers a 16 bytes line from the L 1 cache to the prefetch queue IFU 2 It detects the instruction boundaries within a 16 byte block (half cache line). In the IFU 2 any conditional BRANCH address is forwarded to the BTB (physical addresses!). Up to 4 addresses can be in parallel analyzed by the BTB. Initially the BTB is obviously empty and for each decision taken the BTB is updated. If the branch is predicted as taken the following instructions loaded in the prefetch buffer are removed and the buffer is loaded again with the destination instructions. If the branch is predicted as not taken no change During the branch execution in the Jump Execution Unit no problem if the branch was correctly predicted, otherwise all following ROB u-ops are cancelled together with their results. The same occurs to all other instructions already in the pipeline. The prefetch buffer is emptied and loaded again with the correct instruction sequence. 6

Branch The BTB is made of a 4 -ways set-associative cache with 512 entries (for each index there are 4 physical branch addresses which are handled) The prediction algorithm is two-levels: for each BTB entry there is a 4 bit register which stores the behaviours of the last occurrences of the address (BHT). A further buffer exists in the P 6 (the Return Stack Buffer) which stores the return addresses of the speculated subroutines. When a call is speculated (executed before beeing top of the instruction queue) it is not yet sure whether it must be really executed since a previous branch could change the instruction flow. In this case the stack would have been «corrupted» . The content of the RSB are transferred to the real stack as soon as the call is actually executed. The RSB consists of 8 entries 7

Pipeline Prefetch Buffer IFU 1 IFU 2 IN ORD. Instruction lenght detection IFU 3 DEC 1 Branch Target Buffer (alignment for the decodimg) Decoder DEC 2 ||6 RAT OUT OF ORD. Decoder queue ROB DIS Up to 6 m-ops/clock Compensation queues are needed for different stages speed EX IN ORD. RET 1 RET 2 In the ROB the m-ops are stored in order, are executed OOO, are retired in order Functionally this pipeline is triple 8

IFU 3 stage • It prepares the instructions for the three decoders of stage DEC 1 • Using the «markers» inserted into the 16 bytes block by IFU 2, IFU 3 rotates, if needed, the three IA instructions so as to aligne them for the next stage • If the three instructions are «simple» no rotation is needed and they are forwarded to the three decoders with no intervention • If in the three instructions there is one «complex» and two «simple» a rotation takes place so as to align the «complex» to decoder 0 • If there are two (or more) «complex» instructions the compiler generated instruction sequence is not optimal and the operations take place in sequence Instructions types • Simple (converted into a single m-operation): register to register, memory read , etc • Complex-2 (converted into 2 m-operations): memory write, read/modify, register-memory (sometimes requiring 3 m-operations) • Complex-3: MMX • Complex-4: read/modify/write (ex. add [BP], bx) 9

Decoding Fetch and Aligning IFU 1 -IFU 2 -IFU 3 16 bytes DEC 1 Decoder 0 complex Decoder 1 simple Decoder 2 simple (4+1+1 = 6) x 118 bits decoded m-operations DEC 2 queue (up to 6 m-ops) 3 x 118 bits Micro Instruction Sequencer MIS RAT 3 x 118 bits ROB: in the Pentium II 40 slots: loaded with 3 u-ops max per clock ROB RS 1 RS 2 RS 3 RS 4 RS 5 20 m-ops queue for the Res. Stations From the RS to the FU 10

DEC 1 and DEC 2 Stages DEC 1 Decoder 0 complex decodes IA instructions into 1 -4 m-ops Decoder 1 simple decodes IA instructions into 1 m-op Decoder 2 simple decodes IA instructions into 1 m-op • The decoder 0 is able to convert in a single clock a complex instruction not longer than 7 bytes generating max 4 m-operations • Decoders 1 e 2 are able to convert in a single clock a «simple» instruction not longer than 7 bytes generating max 1 m-operation • Up to 6 m-operations per clock can be generated • In all other cases MIS The Micro Instruction Sequences is a ROM which stores the m-operations associated to each complex IA instruction which cannot be decoded in a single clock period. • The generated sequences (max 6 m-ops per clock) are directly fed into stage DEC 2 • If the decoded instruction is a JMP the instruction queue is immediately emptied and reloaded DEC 2 • The static BTB (see next slide) is activated if among the m-operations of the preceding clock there is a m-op branch not handled by the dynamic BTB (not detected as branch – it must noticed that here the instructions are already RISC type: two sources and one destination !!!) • The m-ops are queued in the same order as they were produced. The queue has 6 slots 11

Static BTB The P 6 uses a static BTB in the stage DEC 2 (the stage which decodes the opcode of the m-ops). It handles the branches not present in the dynamic BTB. It is “static” because uses static rules not depending on the previous instruction history. yes IP relative ? no Conditional ? no - taken yes taken no taken Back ? yes not taken The static prediction includes the destination address evaluation too 12

RAT stage 0 1 2. . . . 39 Register Allocation Table (Register Renaming) EAX EBX ECX EDX ESI EDI ESP EBP RAT 13

ROB stage • The m-ops with the registers renamed in the RAT stage are stored in order three per clock in the Re. Ordering Buffer which has 40 slots (much more in the modern processors which however derive from the P 6 architecture) • The Reservation Station (the unity which handles the functional units availability) extracts up to 5 m-ops per clock from the ROB (there are 5 ports – busses toward the RS) storing them in a buffer with 20 slots whence they are extracted to be forwarded to the exec units • After the execution the m-ops are stored back into the ROB together with the results. In the ROB there are two pointers : one for the «oldest» m-ops not yet retired and one for the first free slot (if any) where to store the new m-ops • The m-ops are “committed” always three at a time in order. This entails that no m-ops is comitted before a preceding branch has not been solved. • The ROB can be viewed as a “ 40 instructions window” NB: Very often the «ports» are common to many functional units. The ports are the busses which link – for instance – the ROB with the FU and require always a lot of space in the IC 14

EX Stage (5 functional units only) Mov EAX, Mem PORT 2 Typical ìnstructions PORT 3 Store Address Unit Load Unit PORT 4 Store Data Unit 5 m-ops Reorder Buffer ROB Reservation Station RS (20 slots) Mov Mem, EAX PORT 1 PORT 0 INC EAX FMUL ST 0 FDIV ST 1 Same port Integer Unit 1 FP Unit Jump execution Unit JMP xxxx Same port 15

Instructions and m-ops execution IFU 1, IFU 2, IFU 3 DEC 1, DEC 2 3 CK 2 CK Prefetch 16 bytes Memory address of the first corresponding IA instruction Memory Status address 0 Decoders 0 1 2 MIS RAT, ROB ID queue 0 1 2 3 4 5 m-op op-code 2 CK 0 1 2 renamed registers (RAT) 39 ROB (actually the size depends on the processor) 16

m-ops in the ROB • m-ops states in the ROB: Ø Ø Ø SD: scheduled for execution. The m-op has been inserted in the RS queue but not yet sent to the FU DP: dispatchable. It is in “pole position” in the EU queue EX: executed. It is being executed WB: write back. About to be rewritten in the ROB after the execution. Unblocks other m-ops stalled waiting for its result RR: ready for retirement. The m-op can be retired RT: retired. The m-op is being retired • Memory address: it is the memory address of the first byte of the IA 32 instruction corresponding to the m-op(s). The address fied for the following m-ops is empty(a IA 32 instruction can correspond to many m-ops). An address, therefore, signals a new IA 32 instruction • m-op type: branch or not branch • Allocation register: one of the 40 allocation registers • It must be noticed that in case of exception a flag in inserted into the last m-op of the instruction: the exception is handled only when the m-ops of the instructions have been retired. All preceding m-ops were already retired (precise interrupt) 17

RESET Initial JUMP Prefetch Streaming Buffer (32 bytes) AL RESET ID queue Decoders RAT/ROB vvvviiiiiiii Jump 8 bytes v=valid code byte i=invalid bytes Memory Status address MIS m-op op-code renamed registers (RAT) 0 39 N. B. The dynamic BTB is obviously unable to predict the branch 18

RESET –IFUi stages Prefetch Streaming Buffer (IFU 1) (stores 32 bytes – a cache line) First instruction boundary Jump FFFF: F FFFFFFF: 0 vvvviiii i: not signifcant bytes iiiiiiii • The first instruction is always a backward jump (instruction present in IFU 1) • In IFU 2 the first instruction boundary is detected (8 bytes). In the remaining 24 bytes other not-signifcant instructions • In IFU 3 the first instruction is aligned to 0 NB Each clock a 32 bytes line is read by IFU 1. In case of «pipeline traffic jam» , because of the decoders, the pipeline stalls 19

RESET Prefetch Streaming buffer ID queue Decoders RAT/ROB JMP MIS Memory Status address m-op op-code renamed registers (RAT) 0 39 20

RESET –DECi stages • The detected instructions are decoded by DEC 1. • DEC 1 transforms the JMP in a jump m-op (in P 6 all jumps are transformed in Branches Taken ) • Instructions in the stages from IFU 1 to DEC 2 are emptied-. This provokes a stall in the pipeline which must reload instructions from the jump address. The m-op is stored in the queue of the decoded instructions 21

RESET Prefetch Streaming buffer RESET Decoders ID queue Branch m-op MIS Memory Status address m-op op-code RAT/ROB renamed registers (RAT) 0 39 22

RESET – RAT stage • The m-op is extracted from the queue of the decoded instructions (which still has the initial order) and inserted in the RAT stage for possible register assignment (not used for branch) 23

RESET Prefetch Streaming buffer RESET ID queue Decoders RAT/ROB Branch m-op MIS Memory Status address m-op op-code renamed registers (RAT) 0 39 24

RESET – ROB and RS stage • The m-op is then sent to the first free ROB slot (normally three of them are trasferred in order to the ROB if there are free slots) • From the ROB the m-op is then sent to the RS queue (4 x 5 slots) as soon a slot for its FU is available. This operation can be done in parallel to the previous one if there are slots available. This is the case of the first instruction at the RESET 25

RESET Prefetch Streaming buffer ID queue Decoders MIS Memory Status address 0 FFFF 0 m-op op-code branch m-op RAT/ROB renamed registers (RAT) none 39 26

RESET – execution and retirement • The RS after a branch execution informs the BTB in order to update the prediction. ) • The m-op after the execution is tagged as «executed» in the ROB. If a m-op produces a result (typically a register value) for another mop (stalled) waiting for it, the waiting m-op status becomes “ready” in the ROB and inserted in the RS as soon as a slot is free • Three m-ops are retired in order bewteen them too per clock. 27

Instructions execution Stato 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 Mem. Addr. EX 020000042 RR 020000044 EX 020000045 020000051 SD DP 020000055 RR 020000057 020000000 RT RT RT RR RR RR EX WB RR RR EX DP SD RR RR RR WB EX RR RR SD RR m-operation non-branch m-op non-branch m-op non-branch u-op branch m-op 020000000 non-branch m-op 020000001 non-branch m-op 020000003 non-branch m-op 020000004 non-branch m-op 02000000 A non-branch u-op 02000000 C non-branch m-op 02000000 F non-branch m-op 020000010 non-branch m-op 020000014 non-branch m-op non-branch u-op 020000016 non-branch m-op 02000001 B non-branch m-op 020000021 non-branch m-op 020000025 non-branch m-op 020000026 non-branch m-op non-branch u-op non-branch m-op 02000002 C non-branch m-op 02000002 F non-branch m-op 020000034 non-branch m-op 020000037 non-branch m-op Renamed register ROB start 28

ROB – description (1) 13. This is the ROB oldest m-op which corresponds to IA instruction whose first byte is at address 02000000 which will be retired together the m-ops of slots 14 and 15 14. This m-op (together those in slots 15 and 16) corresponds to an IA instruction 2 bytes long starting at address 02000001. It must be noticed that the 3 m-ops related to the same IA instruction are NOT retired in the same clock. The address of the first byte of the following IA instruction is 02000003 (slot 17) 15. See previous description (m-op now retired) 16. See previous description (m-op ready for retirement) 17. This m-op corresponds to a IA instruction one byte long at address 02000003. It is ready for retirement and will be retired with m-ops in slots 16 and 18 18. This m-op is the only one generated by the 6 bytes long IA address at addresses 02000004 -02000009. It is RR 19. This m-op corresponds to the two bytes long IA instruction starting at address 0200000 A. It is now being executed and can last more than a clock. At the end its status will be changed from EX to WB. It will be retired when Ø Its execution is completedd Ø The result is written in slot 19 Ø All previous m-ops in the slots 13 -18 have been already retired 29

ROB – description (2) 20. This m-op is the only one generated by the instruction at addresses 0200000 C-0200000 E. Its execution is complete and the result is being written in the slot 20 (status WB). The m-op will be then RR but it will be not retired until the m-ops in the slots 19 and 21 are RR 21. This m-op (similar to that of slot 22) corresponds to a single byte IA instruction at address 0200000 F. It is RR but must wait for mops in the slots 19 and 20. 22. Also this m-op (similar to that of slot 21) corresponds to the same single byte IA instruction at address 0200000 F. It will be retired together with the m-ops in the slots 23 and 24 23. This m-op derives from IA instruction at addresses 0200001002000013. It is still being executed (EX). After execution its status will be WB and afterwards it will become RR and retired together with m-ops in the slots 22 and 24 (when they will be RR) 24. This m-op (as those of the slots 25 and 26) corresponds to the two bytes IA instruction starting at address 02000014. It is waiting for execution and on the RS queue top (DP status). It will be retired together with m-ops in the slots 22 and 23 25. This m-op derives form the same instruction of m-op in the slot 24 but its status is SD that is is already in the RS queue but not on top 26. It derives again from the same IA instruction of the slots 24 and it is RR together with the m-ops in the slots 25 and 27 ………………………………. 30

ROB – description (3) ……………………………………. 1. This m-op derives form the one-byte IA instruction at address 02000044. It is RR and will be retired together with m-ops in the slots 0 and 2 as soon: Ø The m-ops of the slots 0 and 2 have completed their execution and their results are in the slots 0 and 2 Ø All m-ops in the slots 13 -39 have been already retired The m-op in the slot 2 derives from IA instruction at hexadecimal addresses 02000045 -02000050 (12 bytes). ………………………………… 6. This m-op corresponds to the IA instruction at addresses 02000055 -02000056 7. This m-op is an already executed branch (RR status) corresponding to the IA instruction at address 02000057. It will be retired together with m-ops of the slots 6 and 8. The branch was predicted as taken and the prediction was detected as correct during the execution, then. . 8. . . the m-op of this slot derives from the i. A instruction at address 02000000 (branch destination address) N. B. If the predction had been detected as incorrect the m-op of the slot 8 and all the following m-ops would have been cancelled 31

After retiring 13, 14, 15 Stato 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 Mem. Addr. EX 020000042 RR 020000044 EX 020000045 020000051 SD DP 020000055 RR 020000057 020000000 RR RR RR EX WB RR RR EX DP SD RR RR RR WB EX RR RR SD RR 020000003 020000004 02000000 A 02000000 C 02000000 F 020000010 020000014 020000016 02000001 B 020000021 020000025 020000026 02000002 C 02000002 F 020000034 020000037 m-operation non-branch m-op non-branch m-op non-branch u-op branch m-op non-branch m-op non-branch m-op non-branch m-op non-branch u-op non-branch m-op non-branch m-op Renamed register ROB start 32