FPGAbased Fast Cycle Accurate Full System Simulators Derek

FPGA-based Fast, Cycle -Accurate Full System Simulators Derek Chiou, Huzefa Sanjeliwala, Dam Sunwoo, John Xu and Nikhil Patil University of Texas at Austin

Wouldn’t it be nice to have a simulator that is l l l Fast l 10 M cycles per second, fast enough to run real datasets to completion Accurate l Produce cycle-accurate numbers Complete l Run real operating systems, applications Transparent l Can see everything in processor, no performance hit Inexpensive l Need thousands Usable l Quick changes, easy to see performance

Software? l Software-based simulators inherently cannot achieve this speed and be cycle-accurate at the same time l l l A 128 entry, fully-associative TLB at the limit requires 128 load, compare operations Arbitration requires first looking across multiple bidders There are lots of these structures in a complex processor! l l Thousands to tens of thousands of events Even with perfect parallelism, need a lot of CPUs

Hardware l l l Clearly, hardware is necessary Reconfigurability (read FPGAs) is required for flexibility But how?

Full Implementation? l Take RTL code, compile for FPGA l l l Emulate Pentium M in a single FPGA? l l 140 M transistors Instead, what about l l Implementing full system in FPGA is prohibitively large Shih-Lin Lu’s group has single original Pentium (586, 3. 1 M transistors) in largest Xilinx FPGA Accurately (to cycle resolution) simulate its behavior Running real, unmodified applications, OS With full visibility at full speed? If execution speeds are reasonable, do I care? Derek Chiou, UTexas, Austin

Can I Partition the Problem? l l 64 b adder way too big to be implemented as a single monolithic entity But, I can implement 64 1 b adders very easily with very little state and complexity Partitioning is good if possible But, how to partition?

Classic Partitioning l On module boundary l l l Caches, memories, ALUs, processors, memory controllers Partitioning doesn’t save state or complexity, but enables design to be partitioned over multiple FPGAs and software Problems? bypass I 1 0 x 2 IR Add I 2 PC we rr 1 rr 2 addr rd 1 inst Instruction $/Mem IR IR IR wr wd A ALU rd 2 GPR File Y B we waddr rdata Data $/Memory Immed. Extend wdata re MD 1 algn MD 2 0 1 M 2 3 R

Functional/Timing Partition l l l Functional model simulates ISA Timing model simulates micro-architecture Asim and Simplescalar are written like this l l l Software One processor Lots of interaction between functional and timing l l Intended to avoid rollback of any component Put timing model in FPGA? ? ? l Parallel component executed in hardware!

UT FAST Partitioning l On ISA/micro-architecture boundary (ISA + FPGA) l Instruction trace generated by ISA simulator (e. g. , Bochs, Simics) l l Fast, full system but no timing information (could be hardware!!!) What do we need to simulate in the timing model? bypass I 1 0 x 2 IR Add I 2 PC Trace we rr 1 rr 2 addr rd 1 inst Instruction Memory IR IR IR wr wd A ALU rd 2 GPR File Y B we waddr rdata Data Memory Immed. Extend wdata re MD 1 algn MD 2 0 1 M 2 3 R

l UT FAST Complex Processors Straight pipelines are easy what about l Caches/TLBs? l l l Superscalar (multiple issue)? l l l l Keep tags, pass address (virtual and physical if necessary) Hits, misses determined but don’t need data “Fetch and issue” multiple instructions assuming they meet boundary constraints Multiple “functional units” Reservation stations Reorder buffer Pipeline control along with instructions NO DATAPATH!!! Timing Model speed almost unimportant! l Multi-cycle memories to create more ports

Example of Complication: Branch Prediction l Must process mis-speculated instructions in timing model l l Implement BP in timing model Timing model forces ISA simulator to mis-speculate l l l Rollback, restore Requires support from ISA simulator Branch predictor in ISA simulator? l BP only works in processor if it’s fairly accurate l FAST simulators take advantage of the fact that most of the time micro-architecture is on the right path l Most complexity (BP, parallelism) can be handled this way

Status & Conclusions l 1 MHz to 100 MHz, cycle-accurate, full-system, multiprocessor simulator l l X 86, boots Linux, Windows, targeting 80486 to Pentium D-like and beyond (Dam Sunwoo, Nikhil Patil) l l l Have straight pipeline 486 model with TLBs and caches Statistics gathered in hardware l l Bochs functional model (looking at much faster models) Heavily modified instruction trace and rollback Branch-predicted superscalar model almost done in Bluespec and Verilog (John Xu, Huzefa Sanjeliwala) l l Well, not quite that fast right now, but we are using embedded 300 MHz Power. PC 405 to simplify Very little if any probe effect Tools to semi-automate micro-architectural and ISA level exploration l Orthogonality of models makes both simpler