ECE 721 Overview Spring 20 Prof Rotenberg Performance

ECE 721 Overview Spring’ 20 Prof. Rotenberg

Performance Strategies Application Class Nature of Parallelism Architecture Approach sequential programs Instruction-Level Parallelism (ILP) general-purpose superscalar or VLIW irregular and fine-grained speculation is key theme Data-Level Parallelism (DLP) vector, SIMD, SIMT (GPGPU) ECE 721 ECE 786 data-parallel programs regular and fine-grained ECE 506 ECE 706 thread-parallel programs Thread-Level Parallelism (TLP) parallel computers, multi-core regular and coarse-grained ECE 721, Spring'20 Prof. Eric Rotenberg 2

The Big Five ILP Techniques (ECE 463/563 Review) • Pipelining • Overlap instructions for higher throughput • Caches, prefetching • Bridge processor-memory speed gap • Branch prediction • Remove control dependencies for effective pipelining • Out-of-order execution • Tolerate long-latency instructions (multi-cycle instructions, cache-missed loads) by fetching and executing future independent instructions while dependent instructions wait • Expose adequate instruction-level parallelism to supply the parallel execution lanes of a superscalar processor (next bullet) • Superscalar • Exceed scalar (1 instr. /cycle) performance via multiple-instruction issue (N instr. /cycle) ECE 721, Spring'20 Prof. Eric Rotenberg 3

ILP Scaling in Commercial Processors Processor Generation Issue Width In-flight Instructions Pentium 5 1 instr. ~5 Pentium-III 10 3 µ-ops ~40 Pentium-IV 20 3 µ-ops 126 IBM Power 4 12 5 instr. 200 10 (issue) 8 (retire) 224 IBM Power 8 ECE 721, Spring'20 Pipeline Depth (fetch to execute) Prof. Eric Rotenberg 4

ECE 463/563: Reorder Buffer; ECE 721: Active List Load Queue Store Queue Issue Queue, a. k. a. Reservation Stations, a. k. a. Scheduler ECE 463/563: ARF+Reorder Buffer; ECE 721: Physical Register File ECE 721, Spring'20 Prof. Eric Rotenberg 5

ECE 721 Topics • Modern Superscalar Processors • Contemporary organization • • • Physical Register File Memory dependencies Canonical superscalar pipeline Implementation details Superscalar complexity • Simultaneous multithreading (SMT) • Firstly, to see how to implement it in a superscalar processor • Secondly, having multiple threads allows for helper thread based techniques to accelerate a single-threaded program • Data flow bottleneck • Value prediction • Cache miss bottleneck • Value prediction • Latest data prefetchers • Load pre-execution • Kilo-instruction processors: checkpoint processing and recovery (CPR), continual flow pipelines (CFP), dual core execution (DCE), runahead execution • Branch misprediction bottleneck • Exotic history-based branch predictors: TAGE, Perceptron • Branch pre-execution • Control-flow decoupling • Custom branch predictors (see PSM below) • Dynamic hammock prediction • Control independence • Multipath execution • Instruction fetch width bottleneck • Trace cache • Multiple branch prediction • Post-silicon Microarchitecture (PSM) • My current research project • Connect key pipeline stages of a superscalar core to a reconfigurable fabric (FPGA/CGRA). Synthesize novel microarchitecture components (predictors, prefetchers, etc. ) on the fly. ECE 721, Spring'20 Prof. Eric Rotenberg 6

Style 1 (ECE 563) ECE 721, Spring'20 Prof. Eric Rotenberg 7

Style 2 (ECE 721) branch predictor instruction fetch Free list I$ exception Arch. recovery Map rename Map decode, rename, dispatch Issue Queue (IQ) misp. branch recovery retire head Shadow Maps OOO issue execution Function Units (FUs) Physical RF tail Active List ECE 721, Spring'20 complete Prof. Eric Rotenberg 8

Superscalar Complexity ECE 721, Spring'20 Prof. Eric Rotenberg 9

Value prediction 4 -wide instruction fetch bundle A B C D fetch decode rename dispatch A v. A B v. B C A, B, C, and D, issue and execute serially due to data dependencies v. C D ECE 721, Spring'20 Prof. Eric Rotenberg 10

Value prediction (cont. ) A C B A p. A B p. B D p. C D C fetch decode rename value predict and write predicted values into PRF dispatch PRF p. A A p. A v. A ECE 721, Spring'20 B p. B v. B ’ p. B C p. C D v. C ’ = = = misp Prof. Eric Rotenberg issue register read execute validate predicted destination value 11

Project Frameworks • 721 sim (C++) • Required • Cycle-level execute-at-execute simulator of a superscalar processor • Projects 2 and 3 are training ground • Any. Core (verilog) • Optional • Highly-parameterized synthesizable RTL design of a superscalar core • Superscalar widths and structure sizes are configurable • Other • Must be approved by instructor • If needed by your custom research project • Compilers, other simulators, etc. ECE 721, Spring'20 Prof. Eric Rotenberg 12