Simple Scalar v 3 0 Tutorial U of

  • Slides: 39
Download presentation
Simple. Scalar v 3. 0 Tutorial U. of Wisconsin, CS 752, Fall 2004 Andrey

Simple. Scalar v 3. 0 Tutorial U. of Wisconsin, CS 752, Fall 2004 Andrey Litvin (main source: Austin & Burger) (also Dana Vantrease’ slides)

Simulator Basics • What is an architectural simulator? – Tool that reproduces the behavior

Simulator Basics • What is an architectural simulator? – Tool that reproduces the behavior of a computing device • Why use a simulator? – Flexible • Rapid design space exploration • Tailor abstraction level to need – Cheap • Why not use a simulator? – Slow – Correctness?

Functional vs. Performance • Functional simulators implement the architecture - Perform the actual execution

Functional vs. Performance • Functional simulators implement the architecture - Perform the actual execution - Implement what programmers see • Performance (or timing) simulators implement the microarchitecture. - Model system resources/internals - Measure time - Implement what programmers do not see

Trace vs. Execution Driven (2) • Trace + Easy to Implement - Requires large

Trace vs. Execution Driven (2) • Trace + Easy to Implement - Requires large disk files to store instruction stream - Limited state stored - Speculation? Multiprocessor? • Execution - Hard to Implement + Allows access to full state information at any point during program execution +/- Execution requires inclusion of instruction set emulator and an I/O emulation module

Simplescalar • Developed by Todd Austin with Doug Burger at UW, ’ 94 -’

Simplescalar • Developed by Todd Austin with Doug Burger at UW, ’ 94 -’ 96 • Execution – driven • Collection of simulators that emulate the microprocessor at different detail levels (functional, functional + cache / bpred, out-of-order cycle-timer, etc. ) • Tools: – C/Fortran compiler, assembler, linker (for PISA) – DLite: target-machine-level debugger – pipeline trace viewer

Advantages of Simple. Scalar • Fast (given priority over clarity) – 4 MIPS functional,

Advantages of Simple. Scalar • Fast (given priority over clarity) – 4 MIPS functional, 300 KIPS OOO on 1. 5 GHz host • • • Relatively(!) simple - short learning curve Modular design Well documented and commented Popular - support community, extensions Limitations will be summarized at the end

Sim-Fast • Bare functional simulator • Does not account for the behavior of any

Sim-Fast • Bare functional simulator • Does not account for the behavior of any part of the microarchitecture • Optimized for speed

Sim-Safe • Similar to sim-fast (slower) • Implements some memory op safeguards – Memory

Sim-Safe • Similar to sim-fast (slower) • Implements some memory op safeguards – Memory alignment – Memory access permission • Good for debugging sim-fast crashes

Sim-EIO • External trace/checkpoint generator • Functional simulator like sim-fast and simsafe • Implements

Sim-EIO • External trace/checkpoint generator • Functional simulator like sim-fast and simsafe • Implements checks like sim-safe

Sim-Profile • Profiles by symbol and address • Keeps track of and reports (many

Sim-Profile • Profiles by symbol and address • Keeps track of and reports (many options) – – Dynamic instruction counts Instruction class counts Usage of address modes Profiles of the text & data segment

Sim-Cache/Sim-Bpred • Functional core drives the detailed model of the cache/branch predictor • Similar

Sim-Cache/Sim-Bpred • Functional core drives the detailed model of the cache/branch predictor • Similar to trace-driven cache/bpred simulation • Fast results for miss/misprediction rates • No timing simulation/performance impact

Cache Implementation • block size, # of sets, associativity, all customizable • Replacement policies:

Cache Implementation • block size, # of sets, associativity, all customizable • Replacement policies: random, FIFO, LRU • 2 -level cache hierarchy supported (easily extended) • Unified/separate I/D L 1 and L 2 caches

Implemented Predictors • Specifying branch predictor type -bpred <type> • Implemented predictors – nottaken

Implemented Predictors • Specifying branch predictor type -bpred <type> • Implemented predictors – nottaken – perfect – bimod – 2 lev always predicts not taken always right BTB with 2 -bit counters 2 -level adaptive branch predictor • Specify level 1 size, level 2 size, history size(# bits), XOR PC and history? • GAg, GAp, PAg, PAp, gshare – combining (meta) predictor

Sim-Outorder • Detailed performance simulator • Out-of-order execution core • Register renaming, reorder buffer,

Sim-Outorder • Detailed performance simulator • Out-of-order execution core • Register renaming, reorder buffer, speculative execution • 2 -level cache hierarchy • Branch prediction

Making a Fast Timing Simulator • Hardware advantage: parallelism • Software emulator advantage: free

Making a Fast Timing Simulator • Hardware advantage: parallelism • Software emulator advantage: free space for auxillary structures • Functional and detailed performance parts can be decoupled

Simulator RUU vs. Logical RUU • No consumer -> producer links (tags) • Rather,

Simulator RUU vs. Logical RUU • No consumer -> producer links (tags) • Rather, producer->consumer links for efficient result broadcast at writeback • Values not tracked (except address) • “Completed” bit (ROB and IQ combined) • Same struct used for LSQ

Other Simulator Structures • ready_queue – Ready to issue instructions – Used at issue

Other Simulator Structures • ready_queue – Ready to issue instructions – Used at issue stage – Built at writeback – Limited issue bandwidth – Policy: mul/div, load/store, branch first, then oldest first

Other Simulator Structures • fu_pool – functional units – – – Issue latency vs.

Other Simulator Structures • fu_pool – functional units – – – Issue latency vs. operational latency Both constant, hard-coded in sim-outorder Read port latency more detailed (cache simulation) Issue latency – busy counter at FU Operational latency – event queue • event_queue – ordered by cycle – schedules writeback events

Other Simulator Structures • create_vector – Register renaming – Maps logical register to RUU

Other Simulator Structures • create_vector – Register renaming – Maps logical register to RUU entry (architectural register up to date if entry is null) – Updated at dispatch and writeback – Similar to Qi in Tomasulo – Backed up during dispatch if sim “divines” misspeculation

Stage Implementation (greater detail in SS hack guide and v 2. 0 tutorial) •

Stage Implementation (greater detail in SS hack guide and v 2. 0 tutorial) • Fetch (at ruu_fetch()) – Fetch ins-s up to cache line boundary – Block on miss – Put in Fetch Queue (FQ) – Probe branch predictor for next cache line to fetch from

Stage Implementation • Decode (at ruu_decode()) – – – Fetch from FQ Decode Rename

Stage Implementation • Decode (at ruu_decode()) – – – Fetch from FQ Decode Rename registers Put into RUU and LSQ Update RUU dependency lists, renaming table Sim (functional core): • Execute instruction, update machine state • Detect branch misprediction, backup state (checkpoint)

Stage Implementation • Issue (at ruu_issue() and lsq_refresh()) – Ready queue -> Event queue

Stage Implementation • Issue (at ruu_issue() and lsq_refresh()) – Ready queue -> Event queue – Order based on policy (see ready_queue slide) – Check and reserve FU’s – Mem. Ops check memory dependences • No load speculation (“maybe” dependence respected)

Stage Implementation • Writeback (at ruu_writeback()) – Get finished instructions from ready queue –

Stage Implementation • Writeback (at ruu_writeback()) – Get finished instructions from ready queue – Wake up (put in ready queue) instructions with ready operands (use dependence list) – Performance core detects misprediction here and rolls back the state

Stage Implementation • Commit (at ruu_commit()) – Service D-TLB misses – Update register file

Stage Implementation • Commit (at ruu_commit()) – Service D-TLB misses – Update register file (logically) and rename table – Retire stores to D-cache – Reclaim RUU/LSQ entries of retirees

Limitations • I/O and other system calls – Only some limited functional simulation •

Limitations • I/O and other system calls – Only some limited functional simulation • Lacks support for arbitrary speculation – Only branches can cause rollback – No speculative loads – Harder problem: Decoupling of functional and timing cores (for performance) complicates data speculation extensions

Limitations of Memory System • No causality in memory system – All events are

Limitations of Memory System • No causality in memory system – All events are calculated at time of first access • Simple TLB and virtual memory model – No address translation – TLB miss is a (small) fixed latency parallel with cache access • Bandwidth, non-blocking – Modeled as n FU’s (read/write ports with fixed issue delay) • Accurate if memory system is lightly utilized – SMT extensions? • Overhaul required for multiprocessor simulation

Extensions www. simplescalar. com • trace cache • value prediction • SMT • multiprocessor

Extensions www. simplescalar. com • trace cache • value prediction • SMT • multiprocessor • more target ISA / host OS support

Miscellaneous • Enter “sim-outorder” with no options to get options list and usage examples

Miscellaneous • Enter “sim-outorder” with no options to get options list and usage examples • Options database (options. h) – Interface to register an option and check if entered at initialization • Stats database (stats. h) – Register new stats – Define secondary stats and track distributions

Miscellaneous

Miscellaneous