Confessions of a RAMP Heretic Fast FullSystem CycleAccurate

  • Slides: 17
Download presentation
Confessions of a RAMP Heretic: Fast, Full-System, Cycle-Accurate x 86/Power. PC/ARM/Sparc Simulators Derek Chiou

Confessions of a RAMP Heretic: Fast, Full-System, Cycle-Accurate x 86/Power. PC/ARM/Sparc Simulators Derek Chiou University of Texas at Austin Electrical and Computer Engineering 6/15/06 Derek Chiou, UT Austin, RAMP 1

FAST Goals l Fast: as fast as possible l 2 -3 orders of magnitude

FAST Goals l Fast: as fast as possible l 2 -3 orders of magnitude slower than target? l l l l 6/15/06 Fast enough to run real datasets to completion Interactive? Accurate: produce cycle-accurate numbers for modern microprocessors (Pentium M) Complete: run unmodified operating systems, applications, ISAs, … Transparent: full visibility, no performance hit Inexpensive: need thousands Usable: quick changes, use RTL to generate I/O: the MOST important part of systems Derek Chiou, UT Austin, RAMP 2

Functional/Timing Partitioning l Proven Partitioning l. Fetch Asim, Instructions Architectural registers Peripheral functionality ….

Functional/Timing Partitioning l Proven Partitioning l. Fetch Asim, Instructions Architectural registers Peripheral functionality …. . Simplescalar, Timing. Decode First, Memoized, etc. Rename l Simplifies simulator. Reservation stations Scheduling window l Promotes reuse Reorder buffer …. l Same performance in software Inst stream Functional Model (ISA) 6/15/06 l l l Timing Asim at 10 KHz Most Model of the time spent in timing model! (Micro-architecture) Hardware? ? ? Derek Chiou, UT Austin, RAMP 3

FAST Inst stream Functional Model Timing Model Full-System (ISA) Simulator l Functional model could

FAST Inst stream Functional Model Timing Model Full-System (ISA) Simulator l Functional model could be l Pure software (QEMU, Bochs, Simics, Sim. Now) l l l 6/15/06 FPGA (Micro-architecture) Use JIT for performance, very fast No better hardware for executing ISA than processor Can operate under the covers (flush cache for example) Pure Hardware (Hoe et al) Hybrid (Hoe et al) Timing model very simple hardware Derek Chiou, UT Austin, RAMP 4

What is a FAST Timing Model? Bypass/interlock I 1 0 x 2 IR Add

What is a FAST Timing Model? Bypass/interlock I 1 0 x 2 IR Add I 2 PC Trace we rr 1 rr 2 addr rd 1 inst Instruction Memory IR wr wd A ALU rd 2 GPR File Y B we waddr rdata wdata re MD 1 Derek Chiou, UT Austin, RAMP algn Data Memory Immed. Extend 6/15/06 IR IR 0 1 M 2 3 R MD 2 5

More Complexity l Caches/TLBs? l l l Superscalar (multiple issue)? l l l 6/15/06

More Complexity l Caches/TLBs? l l l Superscalar (multiple issue)? l l l 6/15/06 Keep tags, pass address (virtual and physical if necessary) Hits, misses determined but don’t need data “Fetch and issue” multiple instructions assuming they meet boundary constraints Multiple “functional units” Schedulers Reorder buffer/instruction window Pipeline control along with instructions NO DATAPATH (and only part of control path)!!!! Derek Chiou, UT Austin, RAMP 6

Driving a Timing Model i. TLB Functional Model i. Cache Align & Pick L

Driving a Timing Model i. TLB Functional Model i. Cache Align & Pick L 2 Cache Decode Sched Memory & I/O timing models d. TLB 6/15/06 d. Cache Derek Chiou, UT Austin, RAMP 7

Functional Model Complexity: BP i. TLB i. Cache l Wrong-path instructions! Implement BP in

Functional Model Complexity: BP i. TLB i. Cache l Wrong-path instructions! Implement BP in timing model Timing model forces ISA simulator to mis-speculate l Align & Pickl L 2 Cache l Rollback, restore Decode l BP only works in processor if it’s fairly accurate Degrades to trace driven! l FAST simulators take Sched advantage of. Sched the fact that most of the time micro-architecture is on the right path l Memory & I/O timing models d. TLB 6/15/06 Most complexity (BP, parallelism) can be handled this way l d. Cache Derek Chiou, UT Austin, RAMP 8

Parallelism: Detect Problem & Rollback FM FM TM Memory TM TM TM Network Memory

Parallelism: Detect Problem & Rollback FM FM TM Memory TM TM TM Network Memory Model 6/15/06 Derek Chiou, UT Austin, RAMP 9

Functional Model Rollback l BR l Need to l Rollback, force branch l Rollback,

Functional Model Rollback l BR l Need to l Rollback, force branch l Rollback, restore and continue How? l set_pc(inst_num, pc) BR l BR BR BR l l Currently implemented with checkpoints l l 6/15/06 Set a particular dynamic instance of an instruction to a particular instruction pointed to by PC Sufficient ISA state, memory, peripherals Works for parallelism too Derek Chiou, UT Austin, RAMP 10

RTL to Timing Model Bypass/interlock I 1 0 x 2 IR Add I 2

RTL to Timing Model Bypass/interlock I 1 0 x 2 IR Add I 2 PC Trace we rr 1 rr 2 addr rd 1 inst Instruction Memory IR IR IR wr wd A ALU rd 2 GPR File Y B we waddr rdata Data Memory Immed. Extend wdata re MD 1 algn 0 1 M 2 3 R MD 2 Timing model perfectly models RTL Verification? ? ? 6/15/06 Derek Chiou, UT Austin, RAMP 11

Current FAST System 6/15/06 Derek Chiou, UT Austin, RAMP 12

Current FAST System 6/15/06 Derek Chiou, UT Austin, RAMP 12

QEMU on Xilinx Power. PC 6/15/06 Derek Chiou, UT Austin, RAMP 13

QEMU on Xilinx Power. PC 6/15/06 Derek Chiou, UT Austin, RAMP 13

Status l x 86 functional model boots Linux, targeting 80486 to Pentium D-like and

Status l x 86 functional model boots Linux, targeting 80486 to Pentium D-like and beyond (Dam Sunwoo) l l Branch-predicted multi-function unit, OOO timing model compiles in Bluespec (FAST group) l l l 6/15/06 Synthesized for FPGA, 8. 5 K lines of code, rated Top 5 User! Memory, disk models l l Modified Bochs and QEMU Hope to have network model soon Have straight pipeline 486 model with TLBs and caches Preliminary statistics gathered in hardware timing model RTL-to-timing model (Nikhil Patil) Defining tools for ISA extension and timing model assembly Derek Chiou, UT Austin, RAMP 14

Timing Model Resources l l OOO, superscalar, 2 b branch prediction, five functional units,

Timing Model Resources l l OOO, superscalar, 2 b branch prediction, five functional units, 32 KB DCache l [INTERFACE: Fast_if]+ [TM: Ifc. VB(interface bt. Bluespec & Verilog)/Cmd. Q/Fetch/Decode/Rename/Execute] : 26% of V 2 P 30 (3593 slices) l 22 Block RAMS (out of 136) l ROB broken right now Early configurable cache model (state shouldn’t change much) l 32 KB 4 -way set associative cache with 16 B cache-lines l l l 2 MB 4 -way set-associative cache with 64 B cache-lines l l 6/15/06 165 slices (1% of a 2 VP 30) 17 block RAMs (12% of a 2 VP 30) 140 slices (1% of a 2 VP 30) 40 block RAMs (29% of a 2 VP 30) Derek Chiou, UT Austin, RAMP 15

Current Performance l Functional model l l Power. PC ISA should be much faster!

Current Performance l Functional model l l Power. PC ISA should be much faster! l l Power. PC on Power. PC Timing model l 6/15/06 Up to 500 K x 86 inst/sec today on V 2 P 30 FPGA l includes rollbacks assuming 5% mis-speculation l Not that optimized 5 MIPS unmodified 10 M+ on 3. 0 GHz Pentium 4 l DRC box should give this performance Not bottleneck! Derek Chiou, UT Austin, RAMP 16

Conclusions l l l 6/15/06 1 MHz to 100 MHz, cycle-accurate, full-system, multiprocessor x

Conclusions l l l 6/15/06 1 MHz to 100 MHz, cycle-accurate, full-system, multiprocessor x 86, x 86 -64, Power. PC, ARM, Sparc simulator Leverage extant full-system simulators FPGA timing models maximize performance and statistic gathering capabilities Pretty much any timing model seems to fit into a single FPGA (Pentium M in V 2 P 30? ) Uniprocesssor, multi-processor capable Tools can minimize creation/modification effort Derek Chiou, UT Austin, RAMP 17