Lecture 11 Memory Data Flow Techniques Loadstore buffer

Lecture 11: Memory Data Flow Techniques Load/store buffer design, memorylevel parallelism, consistency model, memory disambiguation 1

Load/Store Execution Steps Load: LW R 2, 0(R 1) 1. Generate virtual address; may wait on base register 2. Translate virtual address into physical address 3. Write data cache Store: SW R 2, 0(R 1) 1. Generate virtual address; may wait on base register and data register 2. Translate virtual address into physical address 3. Write data cache Unlike in register accesses, memory addresses are not known prior to execution 2

Load/store Buffer in Tomasulo Support memory-level parallelism Loads wait in load buffer until their address is ready; memory reads are then processed Stores wait in store buffer until their address and data are ready; memory writes wait further until stores are committed IM Fetch Unit Reorder Buffer Decode Rename Regfile S-buf L-buf RS RS FU 1 FU 2 DM 3

Load/store Unit with Centralized RS includes part of load/store buffer in Tomasulo IM Fetch Unit Decode Loads and stores wait in RS until there are ready Rename Regfile Reorder Buffer RS S-unit data Store buffer L-unit FU 1 FU 2 addr cache 4

Memory-level Parallelism for (i=0; i<100; i++) A[i] = A[i]*2; Loop: L. S F 2, 0(R 1) MULT F 2, F 4 SW F 2, 0(R 1) ADD R 1, 4 BNE R 1, R 3, Loop F 4 store 2. 0 LW 1 LW 2 LW 3 SW 1 SW 2 SW 3 Significant improvement from sequential reads/writes 5

Memory Consistency Memory contents must be the same as by sequential execution Must respect RAW, WRW, and WAR dependences Practical implementations: 1. Reads may proceed out-of-order 2. Writes proceed to memory in program order 3. Reads may bypass earlier writes only if their addresses are different 6

Store Stages in Dynamic Execution 1. Wait in RS until base RS address and store data are available (ready) Store Load 2. Move to store unit for unit address calculation and address translation finished 3. Move to store buffer completed (finished) 4. Wait for ROB commit (completed) D-cache 5. Write to data cache (retired) Stores always retire in for Source: Shen and Lipasti, page 197 WAW and WRA Dep. 7

Load Bypassing and Memory Disambiguation To exploit memory parallelism, loads have to bypass writes; but this may violate RAW dependences Dynamic Memory Disambiguation: Dynamic detection of memory dependences Compare load address with every older store addresses 8

Load Bypassing Implementation RS in-order Store unit 1 2 match 1 2 3 Load unit data addr D-cache data addr 1. address calc. 2. address trans. 3. if no match, update dest reg Associative search for matching Assume in-order execution of load/stores 9

Load Forwarding RS in-order Store unit 1 2 match 1 2 3 Load unit Load Forwarding: if a load address matches a older write address, can forward data If a match is found, To dest. forward the related reg data addr data to dest register (in ROB) D-cache addr Multiple matches may exists; last one wins data 10

In-order Issue Limitation for (i=0; i<100; i++) A[i] = A[i]/2; Loop: L. S F 2, 0(R 1) DIV F 2, F 4 SW F 2, 0(R 1) ADD R 1, 4 BNE R 1, R 3, Loop Any store in RS station may blocks all following loads When is F 2 of SW available? When is the next L. S ready? Assume reasonable FU latency and pipeline length 11

Speculative Load Execution RS out-order Store unit 1 2 3 match Match at completion addrdata Finished load buffer Load unit data Forwarding does not always work if some addresses are unknown No match: predict addr a load has no RAW on older stores D-cache data If match: flush pipeline Flush pipeline at commit if predicted wrong 12

Alpha 21264 Pipeline 13

Alpha 21264 Load/Store Queues Int issue queue Addr Int Addr ALU ALU Int RF(80) D-TLB L-Q S-Q fp issue queue FP ALU FP RF(72) AF Dual D-Cache 32 -entry load queue, 32 -entry store queue 14

Load Bypassing, Forwarding, and RAW Detection LQ IQ match IQ commit ROB Load/store? SQ completed If match: forwarding D-cache Load: WAIT if LQ head not completed, then move LQ head Store: mark SQ head as completed, then move SQ head If match: mark store-load trap to flush pipeline (at commit) 15

Speculative Memory Disambiguation Fetch PC Load forwarding 1024 1 -bit entry table Renamed inst 1 int issue queue • When a load is trapped at commit, set st. Wait bit in the table, indexed by the load’s PC • When the load is fetched, get its st. Wait from the table • The load waits in issue queue until old stores are issued • st. Wait table is cleared periodically 16

Architectural Memory States LQ SQ Completed entries L 1 -Cache Committed states L 2 -Cache L 3 -Cache (optional) Memory Disk, Tape, etc. Memory request: search the hierarchy from top to bottom 17

Summary of Superscalar Execution Instruction flow techniques Branch prediction, branch target prediction, and instruction prefetch Register data flow techniques Register renaming, instruction scheduling, in-order commit, mis-prediction recovery Memory data flow techniques Load/store units, memory consistency Source: Shen & Lipasti 18