Confessions of a RAMP Heretic Fast FullSystem CycleAccurate
- Slides: 17
Confessions of a RAMP Heretic: Fast, Full-System, Cycle-Accurate x 86/Power. PC/ARM/Sparc Simulators Derek Chiou University of Texas at Austin Electrical and Computer Engineering 6/15/06 Derek Chiou, UT Austin, RAMP 1
FAST Goals l Fast: as fast as possible l 2 -3 orders of magnitude slower than target? l l l l 6/15/06 Fast enough to run real datasets to completion Interactive? Accurate: produce cycle-accurate numbers for modern microprocessors (Pentium M) Complete: run unmodified operating systems, applications, ISAs, … Transparent: full visibility, no performance hit Inexpensive: need thousands Usable: quick changes, use RTL to generate I/O: the MOST important part of systems Derek Chiou, UT Austin, RAMP 2
Functional/Timing Partitioning l Proven Partitioning l. Fetch Asim, Instructions Architectural registers Peripheral functionality …. . Simplescalar, Timing. Decode First, Memoized, etc. Rename l Simplifies simulator. Reservation stations Scheduling window l Promotes reuse Reorder buffer …. l Same performance in software Inst stream Functional Model (ISA) 6/15/06 l l l Timing Asim at 10 KHz Most Model of the time spent in timing model! (Micro-architecture) Hardware? ? ? Derek Chiou, UT Austin, RAMP 3
FAST Inst stream Functional Model Timing Model Full-System (ISA) Simulator l Functional model could be l Pure software (QEMU, Bochs, Simics, Sim. Now) l l l 6/15/06 FPGA (Micro-architecture) Use JIT for performance, very fast No better hardware for executing ISA than processor Can operate under the covers (flush cache for example) Pure Hardware (Hoe et al) Hybrid (Hoe et al) Timing model very simple hardware Derek Chiou, UT Austin, RAMP 4
What is a FAST Timing Model? Bypass/interlock I 1 0 x 2 IR Add I 2 PC Trace we rr 1 rr 2 addr rd 1 inst Instruction Memory IR wr wd A ALU rd 2 GPR File Y B we waddr rdata wdata re MD 1 Derek Chiou, UT Austin, RAMP algn Data Memory Immed. Extend 6/15/06 IR IR 0 1 M 2 3 R MD 2 5
More Complexity l Caches/TLBs? l l l Superscalar (multiple issue)? l l l 6/15/06 Keep tags, pass address (virtual and physical if necessary) Hits, misses determined but don’t need data “Fetch and issue” multiple instructions assuming they meet boundary constraints Multiple “functional units” Schedulers Reorder buffer/instruction window Pipeline control along with instructions NO DATAPATH (and only part of control path)!!!! Derek Chiou, UT Austin, RAMP 6
Driving a Timing Model i. TLB Functional Model i. Cache Align & Pick L 2 Cache Decode Sched Memory & I/O timing models d. TLB 6/15/06 d. Cache Derek Chiou, UT Austin, RAMP 7
Functional Model Complexity: BP i. TLB i. Cache l Wrong-path instructions! Implement BP in timing model Timing model forces ISA simulator to mis-speculate l Align & Pickl L 2 Cache l Rollback, restore Decode l BP only works in processor if it’s fairly accurate Degrades to trace driven! l FAST simulators take Sched advantage of. Sched the fact that most of the time micro-architecture is on the right path l Memory & I/O timing models d. TLB 6/15/06 Most complexity (BP, parallelism) can be handled this way l d. Cache Derek Chiou, UT Austin, RAMP 8
Parallelism: Detect Problem & Rollback FM FM TM Memory TM TM TM Network Memory Model 6/15/06 Derek Chiou, UT Austin, RAMP 9
Functional Model Rollback l BR l Need to l Rollback, force branch l Rollback, restore and continue How? l set_pc(inst_num, pc) BR l BR BR BR l l Currently implemented with checkpoints l l 6/15/06 Set a particular dynamic instance of an instruction to a particular instruction pointed to by PC Sufficient ISA state, memory, peripherals Works for parallelism too Derek Chiou, UT Austin, RAMP 10
RTL to Timing Model Bypass/interlock I 1 0 x 2 IR Add I 2 PC Trace we rr 1 rr 2 addr rd 1 inst Instruction Memory IR IR IR wr wd A ALU rd 2 GPR File Y B we waddr rdata Data Memory Immed. Extend wdata re MD 1 algn 0 1 M 2 3 R MD 2 Timing model perfectly models RTL Verification? ? ? 6/15/06 Derek Chiou, UT Austin, RAMP 11
Current FAST System 6/15/06 Derek Chiou, UT Austin, RAMP 12
QEMU on Xilinx Power. PC 6/15/06 Derek Chiou, UT Austin, RAMP 13
Status l x 86 functional model boots Linux, targeting 80486 to Pentium D-like and beyond (Dam Sunwoo) l l Branch-predicted multi-function unit, OOO timing model compiles in Bluespec (FAST group) l l l 6/15/06 Synthesized for FPGA, 8. 5 K lines of code, rated Top 5 User! Memory, disk models l l Modified Bochs and QEMU Hope to have network model soon Have straight pipeline 486 model with TLBs and caches Preliminary statistics gathered in hardware timing model RTL-to-timing model (Nikhil Patil) Defining tools for ISA extension and timing model assembly Derek Chiou, UT Austin, RAMP 14
Timing Model Resources l l OOO, superscalar, 2 b branch prediction, five functional units, 32 KB DCache l [INTERFACE: Fast_if]+ [TM: Ifc. VB(interface bt. Bluespec & Verilog)/Cmd. Q/Fetch/Decode/Rename/Execute] : 26% of V 2 P 30 (3593 slices) l 22 Block RAMS (out of 136) l ROB broken right now Early configurable cache model (state shouldn’t change much) l 32 KB 4 -way set associative cache with 16 B cache-lines l l l 2 MB 4 -way set-associative cache with 64 B cache-lines l l 6/15/06 165 slices (1% of a 2 VP 30) 17 block RAMs (12% of a 2 VP 30) 140 slices (1% of a 2 VP 30) 40 block RAMs (29% of a 2 VP 30) Derek Chiou, UT Austin, RAMP 15
Current Performance l Functional model l l Power. PC ISA should be much faster! l l Power. PC on Power. PC Timing model l 6/15/06 Up to 500 K x 86 inst/sec today on V 2 P 30 FPGA l includes rollbacks assuming 5% mis-speculation l Not that optimized 5 MIPS unmodified 10 M+ on 3. 0 GHz Pentium 4 l DRC box should give this performance Not bottleneck! Derek Chiou, UT Austin, RAMP 16
Conclusions l l l 6/15/06 1 MHz to 100 MHz, cycle-accurate, full-system, multiprocessor x 86, x 86 -64, Power. PC, ARM, Sparc simulator Leverage extant full-system simulators FPGA timing models maximize performance and statistic gathering capabilities Pretty much any timing model seems to fit into a single FPGA (Pentium M in V 2 P 30? ) Uniprocesssor, multi-processor capable Tools can minimize creation/modification effort Derek Chiou, UT Austin, RAMP 17
- Iron spider medieval
- Heretic questioner
- Joan of arc map
- Bgu confession
- Ahura first defender
- Confessions westminster cathedral
- Confession types
- Confessions of a public defender
- Philadelphia confession of faith
- Bgu confessions
- Example of acid-fast bacteria
- Acid fast and non acid fast bacteria
- On ramp to algebra
- Area of a ramp
- A sack slides off the ramp
- Fed ramp
- Projectile apparatus with impact board and launching ramp
- Autorack ramp