Speculative DataDriven Multithreading an implementation of preexecution Amir

  • Slides: 16
Download presentation
Speculative Data-Driven Multithreading (an implementation of pre-execution) Amir Roth and Gurindar S. Sohi HPCA-7

Speculative Data-Driven Multithreading (an implementation of pre-execution) Amir Roth and Gurindar S. Sohi HPCA-7 Jan. 22, 2001 "DDMT", Amir Roth, HPCA-7 1

Pre-Execution n n Goal: high single-thread performance Problem: μarchitectural latencies of “problem” instructions (PIs)

Pre-Execution n n Goal: high single-thread performance Problem: μarchitectural latencies of “problem” instructions (PIs) n Memory: cache misses, Pipeline: mispredicted branches Solution: decouple μarchitectural latencies from main thread n Execute copies of PI computations in parallel with whole program n Copies execute PIs faster than main thread “pre-execute” n Why? Fewer instructions n Initiate Cache misses earlier n Pre-computed branch outcomes, relay to main thread DDMT: an implementation of pre-execution "DDMT", Amir Roth, HPCA-7 2

Pre-Execution is a Supplement n n n Fundamentally: tolerating non-execution latencies (pipeline, memory) requires

Pre-Execution is a Supplement n n n Fundamentally: tolerating non-execution latencies (pipeline, memory) requires values faster than execution can provide them Ways of providing values faster than execution n Old: Behavioral prediction: table-lookup n + small effort per value, - less than perfect (“problems”) n New: Pre-execution: executes fewer instructions n + perfect accuracy, - more effort per value Solution: supplement behavioral prediction with pre-execution n Key: behavioral prediction must handle majority of cases n Good news: it already does "DDMT", Amir Roth, HPCA-7 3

Data-Driven Multithreading (DDMT) n n DDMT: an implementation of pre-execution n Data-Driven Thread (DDT):

Data-Driven Multithreading (DDMT) n n DDMT: an implementation of pre-execution n Data-Driven Thread (DDT): pre-executed computation of PI Implementation: extension to simultaneous multithreading (SMT) n SMT is a reality (21464) n Low static cost: minimal additional hardware n Pre-execution siphons execution resources n Low dynamic cost: fine-grain, flexible bandwidth partitioning n Take only as much as you need n Minimize contention, overhead Paper: Metrics, algorithms and mechanics Talk: Mostly mechanics "DDMT", Amir Roth, HPCA-7 4

Talk Outline n n n Working example in 3 parts Some details Numbers, numbers

Talk Outline n n n Working example in 3 parts Some details Numbers, numbers "DDMT", Amir Roth, HPCA-7 5

n Running example: same as the paper n Simplified loop from EM 3 D

n Running example: same as the paper n Simplified loop from EM 3 D STATIC CODE for (node=list; node = node->next) if (node->neighbor != NULL) node->val -= node->neighbor->val * node->coeff; n n Use profiling to find PIs Few static PIs cause most dynamic “problems” n Good coverage with few static DDTs "DDMT", Amir Roth, HPCA-7 DYNAMIC INSN STREAM Example. 1 Identify PIs PC I 3 I 5 I 1 I 2 I 3 INST ldq r 1, 0(r 1) br I 1 beq r 1, I 12 ldq r 2, 8(r 1) beq r 2, I 10 I 4 ldt f 0, 16(r 1) I 5 ldt f 1, 16(r 2) I 6 I 7 I 8 I 9 I 10 I 11 I 2 ldt f 2, 24(r 1) mult f 1, f 2, f 3 subt f 0, f 3, f 0 stt f 0, 16(r 1) ldq r 1, 0(r 1) br I 1 beq r 1, I 12 ldq r 2, 8(r 1) I 3 beq r 2, I 10 I 4 ldt f 0, 16(r 1) I 5 ldt f 1, 16(r 2) 6

Example. 2 Extract DDTs n n n n PC INST I 10: I 10

Example. 2 Extract DDTs n n n n PC INST I 10: I 10 ldq r 1, 0(r 1) I 2 ldq r 2, 8(r 1) I 3 beq r 2, I 10 I 5 ldt f 1, 16(r 2) Examine program traces DDTC (static) Start with PIs Work backwards, gather backward-slices Eventually stop. When? (see paper) Pack last N-1 slice instructions into DDT Use first instruction as DDT trigger n Dynamic trigger instances signal DDT fork Load DDT into DDTC (DDT$) "DDMT", Amir Roth, HPCA-7 PC I 10 I 11 I 2 I 3 INST ldq r 1, 0(r 1) br I 1 beq r 1, I 12 ldq r 2, 8(r 1) beq r 2, I 10 I 4 ldt f 0, 16(r 1) I 5 ldt f 1, 16(r 2) I 6 I 7 I 8 I 9 I 10 I 11 I 2 ldt f 2, 24(r 1) mult f 1, f 2, f 3 subt f 0, f 3, f 0 stt f 0, 16(r 1) ldq r 1, 0(r 1) br I 1 beq r 1, I 12 ldq r 2, 8(r 1) I 3 beq r 2, I 10 I 4 ldt f 0, 16(r 1) I 5 ldt f 1, 16(r 2) 7

n n Main thread (MT) Executed a trigger instr? n Fork DDT (μarch) PC

n n Main thread (MT) Executed a trigger instr? n Fork DDT (μarch) PC INST I 10: I 10 ldq r 1, 0(r 1) … DDTC (static) MT, DDT execute in parallel DDT initiates cache miss n “Absorbs” latency DDT n PC I 10 I 2 I 3 I 5 INST ldq r 1, 0(r 1) ldq r 2, 8(r 1) beq r 2, I 10 ldt f 1, 16(r 2) MT integrates DDT results n Instr’s not re-executed reduces contention n Shortens MT critical path n Pre-computed branch avoids mis-prediction "DDMT", Amir Roth, HPCA-7 MT Example. 3 Pre-Execute DDTs PC I 10 I 11 I 2 I 3 INST ldq r 1, 0(r 1) br I 1 beq r 1, I 12 ldq r 2, 8(r 1) beq r 2, I 10 I 4 ldt f 0, 16(r 1) I 5 ldt f 1, 16(r 2) I 6 I 7 I 8 I 9 I 10 I 11 I 2 I 3 I 4 I 5 ldt f 2, 24(r 1) mult f 1, f 2, f 3 subt f 0, f 3, f 0 stt f 0, 16(r 1) ldq r 1, 0(r 1) br I 1 beq r 1, I 12 ldq r 2, 8(r 1) beq r 2, I 10 ldt f 0, 16(r 1) ldt f 1, 16(r 2) 8

Details. 1 More About DDTs n n n Composed of instr’s from original program

Details. 1 More About DDTs n n n Composed of instr’s from original program n Required by integration Should look like normal instr’s to processor PC I 10: I 10 I 2 I 3 I 5 INST ldq r 1, 0(r 1) ldq r 2, 8(r 1) beq r 2, I 10 ldt f 1, 16(r 2) PC I 10: I 10 I 2 No runaway threads, better overhead control I 3 “Contain” any control-flow (e. g. unrolled loops) I 5 INST ldq r 1, 0(r 1) ldq r 2, 8(r 1) beq r 2, I 10 ldt f 1, 16(r 2) Data-driven: instructions are not “sequential” n No explicit control-flow n How are they sequenced? n Pack into “traces” (in DDTC) n Execute all instructions (branches too) n Save results for integration n n "DDMT", Amir Roth, HPCA-7 9

Details. 2 More About Integration n Centralized physical register file n Use for DDT-MT

Details. 2 More About Integration n Centralized physical register file n Use for DDT-MT communication Fork: copy MT rename map to DDT n DDT locates MT values via lookups n “Roots” integration ldq r 1, 0(r 1) ldq r 2, 8(r 1) r 1 ldq r 1, 0(r 1) ldq r 2, 8(r 1) Integration: match PC/physical registers to establish DDT-MT instruction correspondence n MT “claims” physical registers allocated by DDT n Modify register-renaming to do this ldq r 1, 0(r 1) br I 1 beq r 1, I 12 ldq r 2, 8(r 1) beq r 2, I 10 ldt f 0, 16(r 1) ldt f 1, 16(r 2) ldt f 2, 24(r 1) mult f 1, f 2, f 3 subt f 0, f 3, f 0 stt f 0, 16(r 1) ldq r 1, 0(r 1) br I 1 beq r 1, I 12 ldq r 2, 8(r 1) More on integration: implementation of squash-reuse [MICRO-33] "DDMT", Amir Roth, HPCA-7 10

Details. 3 More About DDT Selection n Very important problem n n Fundamental aspects

Details. 3 More About DDT Selection n Very important problem n n Fundamental aspects n Metrics, algorithms n Promising start n See paper Practical aspects n Who implements algorithm? How do DDTs get into DDTC? n Paper: profile-driven, offline, executable annotations n Open question "DDMT", Amir Roth, HPCA-7 11

Performance Evaluation n SPEC 2 K, Olden, Alpha EV 6, –O 3 –fast n

Performance Evaluation n SPEC 2 K, Olden, Alpha EV 6, –O 3 –fast n Chose programs with problems Simple. Scalar-based simulation environment n DDT selection phase: functional simulation on small input n n n DDT measurement phase: timing simulation on larger input n 8 -wide, superscalar, out-of-order core n 128 ROB, 64 LDQ, 32 STQ, 80 RS (shared) n Pipe: 3 fetch, 2 rename/integrate, 2 schedule, 2 reg read, 2 load n 32 KB I$/64 KB D$ (2 -way), 1 MB L 2$ (4 -way), mem b/w: 8 b/cyc. DDTC: 16 DDTs, 32 instructions (max) per DDT "DDMT", Amir Roth, HPCA-7 12

Numbers. 1 The Bottom Line n n n Cache misses n Speedups vary, 10

Numbers. 1 The Bottom Line n n n Cache misses n Speedups vary, 10 -15% n DDT “unrolling”: increases latency tolerance (paper) Branch mispredictions n Speedups lower, 5 -10% n More PIs, lower coverage n Branch integration != perfect branch prediction Effects mix "DDMT", Amir Roth, HPCA-7 13

Numbers. 2 A Closer Look n n DDT overhead: fetch utilization n ~5% (reasonable)

Numbers. 2 A Closer Look n n DDT overhead: fetch utilization n ~5% (reasonable) n Fewer MT fetches (always) n Contention n Fewer total fetches n Early branch resolution DDT utility: integration rates n Vary, mostly ~30% (low) n Completed: well done n Not completed: a little late n Input-set differences n Greedy DDTs, no early exits "DDMT", Amir Roth, HPCA-7 14

Numbers. 3 Is Full DDMT necessary? n n How important is integration? n Important

Numbers. 3 Is Full DDMT necessary? n n How important is integration? n Important for branch resolution, less so for prefetching How important is decoupling? What if we just priority-scheduled? n Extremely. No data-driven sequencing no speedup "DDMT", Amir Roth, HPCA-7 15

Summary n n Pre-execution: supplements behavioral prediction n Decouple/absorb μarchitectural latencies of “problem” instructions

Summary n n Pre-execution: supplements behavioral prediction n Decouple/absorb μarchitectural latencies of “problem” instructions DDMT: an implementation of pre-execution n An extension to SMT n Few hardware changes n Bandwidth allocation flexibility reduces overhead Future: DDT selection n Fundamental: better metrics/algorithms to increase utility/coverage n Practical: an easy implementation Coming to a university/research-lab near you "DDMT", Amir Roth, HPCA-7 16