Uppsala University Uppsala Architecture Research Team FullSystem Simulation

  • Slides: 20
Download presentation
Uppsala University Uppsala Architecture Research Team Full-System Simulation at Near-Native Speed: Parallelizing and Accelerating

Uppsala University Uppsala Architecture Research Team Full-System Simulation at Near-Native Speed: Parallelizing and Accelerating gem 5 through hardware virtualization Andreas Sandberg (ARM), David Black-Schaffer, Erik Hagersten, Trevor Carlson Uppsala University trevor. carlson@it. uu. se 2021 -03 -04 | 1

Uppsala University 2021 -03 -04 | 2 Uppsala Architecture Research Team Problem: Simulation is

Uppsala University 2021 -03 -04 | 2 Uppsala Architecture Research Team Problem: Simulation is Slow ~1 year per SPEC benchmark in gem 5 Oo. O mode Time to simulate SPEC to completion in gem 5 <1 hour per SPEC benchmark on native x 86 HW Native 3, 000 MIPS Functional 1 MIPS Oo. O 0. 1 MIPS This Work: 2, 000 MIPS

Uppsala University 2021 -03 -04 | 3 Uppsala Architecture Research Team Problem: Simulation is

Uppsala University 2021 -03 -04 | 3 Uppsala Architecture Research Team Problem: Simulation is Slow Time to simulate SPEC to completion in gem 5 Near-native speed & detailed simulation Native 3, 000 MIPS Functional 1 MIPS Oo. O 0. 1 MIPS This Work: 2, 000 MIPS

Uppsala University 2021 -03 -04 | 4 Uppsala Architecture Research Team Simulating Faster: SMARTS

Uppsala University 2021 -03 -04 | 4 Uppsala Architecture Research Team Simulating Faster: SMARTS Sampled simulation – Tradeoff accuracy for performance – 0. 1% instructions Oo. O, 99. 9% instructions Fast – Performance ≈ Fast, Accuracy ≈ Oo. O (with continuous cache warming*) 95% time 5% Fast O o O … Simulation Time Note: Speed limited by fast-forwarding to the next sample Need ultra-fast-forwarding *Wunderlich, ISCA 03

Uppsala University 2021 -03 -04 | 5 Uppsala Architecture Research Team Fast-forwarding to Sample

Uppsala University 2021 -03 -04 | 5 Uppsala Architecture Research Team Fast-forwarding to Sample Points To make sampling fast, we need ultra-fast-forwarding. How can we fast-forward x 86 programs really quickly? Core … 6 of them! Core 3, 000 MIPS x 86 simulator… Core How about using this:

Uppsala University 2021 -03 -04 | 6 Uppsala Architecture Research Team Extending gem 5:

Uppsala University 2021 -03 -04 | 6 Uppsala Architecture Research Team Extending gem 5: Native CPU Module Use hardware virtualization to execute on the native CPU inside gem 5 CPU Modules Oo. O ~0. 1 MIPS Fast ~1 MIPS v. FF ~3, 000 MIPS Detailed: Pipeline simulator (timing, queues, speculation…) + caches, TLBs, branch predictor Functional: 1 instruction per cycle + caches, TLBs, branch predictor • Memory mapping • Switching CPU Modules • I/O Virtualized Fast Forward: Hardware CPU via kvm virtualization Measures nothing Can switch between CPU Modules during simulation

Uppsala University Uppsala Architecture Research Team IMPLEMENTING A VIRTUAL CPU MODULE IN GEM 5

Uppsala University Uppsala Architecture Research Team IMPLEMENTING A VIRTUAL CPU MODULE IN GEM 5 3 key problems: memory mapping, switching CPUs, and I/O. 2021 -03 -04 | 7

Uppsala University 2021 -03 -04 | 8 Uppsala Architecture Research Team gem 5 -kvm:

Uppsala University 2021 -03 -04 | 8 Uppsala Architecture Research Team gem 5 -kvm: Memory Mapping • Map kvm to gem 5’s memory image gem 5 Simulator Standard gem 5 memory image access Oo. O CPU Functional CPU Simulated Memory Hierarchy $ I/O Device kvm Hypervisor Virtualized CPU mem image Physical Memory I/O Trap Transparent access to gem 5’s memory image

Uppsala University 2021 -03 -04 | 9 Uppsala Architecture Research Team gem 5 -kvm:

Uppsala University 2021 -03 -04 | 9 Uppsala Architecture Research Team gem 5 -kvm: Switching CPU Modules • Transfer processor state • From gem 5: flush simulated memory hierarchy gem 5 Simulator Oo. O CPU Functional CPU Simulated Memory Hierarchy $ I/O Device kvm Hypervisor Virtualized CPU mem image Physical Memory I/O Trap Transfer registers, PC, status, segment mappings, etc.

Uppsala University 2021 -03 -04 | 10 Uppsala Architecture Research Team gem 5 -kvm:

Uppsala University 2021 -03 -04 | 10 Uppsala Architecture Research Team gem 5 -kvm: I/O • kvm traps on I/O gem 5 simulates the device gem 5 Simulator Oo. O CPU Functional CPU Simulated Memory Hierarchy $ 2. I/O access through gem 5 memory system and device I/O Device kvm Hypervisor Virtualized CPU 3. Results returned to HW CPU via hypervisor Physical Memory I/O Trap 1. I/O access trapped in hypervisor

Uppsala University Uppsala Architecture Research Team ACCELERATING SIMULATION WITH A VIRTUAL CPU MODULE Bounding

Uppsala University Uppsala Architecture Research Team ACCELERATING SIMULATION WITH A VIRTUAL CPU MODULE Bounding cache warming errors and parallel simulation. 2021 -03 -04 | 11

Uppsala University 2021 -03 -04 | 12 Uppsala Architecture Research Team Full Speed Ahead

Uppsala University 2021 -03 -04 | 12 Uppsala Architecture Research Team Full Speed Ahead (FSA) Simulation Combine: But it simulates nothing… – Virtual Fast-Forwarding (get to the sample fast) – Functional Simulation (warm the caches) – Oo. O Simulation (get detailed statistics) 10% of time/95% of instructions in v. FF 2. 0% IPC error @ 300 MIPS 500 M 25 M 20 k v. FF Fast O o (warm) O … Instructions 10% v. FF 80% 10% Fast O o v. FF O Fast O o O (warm) Only warming caches for 25 M instructions… … Simulation Time

Uppsala University 2021 -03 -04 | 13 Uppsala Architecture Research Team Estimating Warming Error

Uppsala University 2021 -03 -04 | 13 Uppsala Architecture Research Team Estimating Warming Error Without constant cache warming we have no guarantee of accuracy. Crazy idea: simulate both possible outcomes… – Clone architecture state after warming – Optimistic: if miss to a cold set simulate a HIT – Pessimistic: if miss to a cold set simulate a MISS Could be used to adjust warming time dynamically Simulation speed 300 MIPS 270 MIPS with dual simulation 10% v. FF 80% 10% Fast O o v. FF O Fast O o O (warm) … Simulation Time 10% v. FF 80% 10% 10% 80% 10% Fast H M I SI v. FF T S Fast O M o SI O S (warm) … Simulation Time

Uppsala University 2021 -03 -04 | 14 Uppsala Architecture Research Team Fork: OS gives

Uppsala University 2021 -03 -04 | 14 Uppsala Architecture Research Team Fork: OS gives copy-on-write Can we go faster? Speed limited by warmup and detailed simulation – If we can clone the architecture state… – Each (warmup+simulation) is independent… – We can execute them in parallel Execution & simulation at near-native speed (v. FF and fork overhead) v. FF v. FF … Simulation Time fork gem 5 another core 1 Core Fast-Forwards O O Fast o o O O Fast O O (warm) O o Fast O (warm) o o O O O O Fast (warm) O o O Fast(warm) Fast O (warm) o o O O Fast (warm) o o O O O Fast (warm) 8 Cores Simulate o o (warm) O

Uppsala University Uppsala Architecture Research Team RESULTS: ACCURACY, SPEED, AND SCALABILITY 2021 -03 -04

Uppsala University Uppsala Architecture Research Team RESULTS: ACCURACY, SPEED, AND SCALABILITY 2021 -03 -04 | 15

Uppsala University 2021 -03 -04 | 16 Uppsala Architecture Research Team Results: Accuracy Optimistic

Uppsala University 2021 -03 -04 | 16 Uppsala Architecture Research Team Results: Accuracy Optimistic vs. Pessimistic (5 M instructions fast warmup) (25 M instructions fast warmup) • Reference: 30 B instructions in detailed Oo. O (~1 week) • gem 5 SMARTS: 1. 9% IPC error • Parallel Full Speed Ahead (p. FSA): 2. 0% IPC error

Uppsala University 2021 -03 -04 | 17 Uppsala Architecture Research Team Results: Speed Scalability

Uppsala University 2021 -03 -04 | 17 Uppsala Architecture Research Team Results: Speed Scalability to >8 cores? (5 M instructions fast warmup) • 8 cores (2 socket Intel Xeon E 5520) – 2 MB L 2: 2, 000 MIPS – 8 MB L 2: 900 MIPS (Warm-up error estimation overhead < 4%) (25 M instructions fast warmup)

Uppsala University 2021 -03 -04 | 18 Uppsala Architecture Research Team Results: Scalability @

Uppsala University 2021 -03 -04 | 18 Uppsala Architecture Research Team Results: Scalability @ 32 Cores ali n g v. FF Overhead ling Ide al Sc Forking Overhead Sca eal Id I/O every 3. 6 M inst. I/O every 13 M inst. Scalability bounds from v. FF and forking (I/O and Co. W) – 471. omnetpp 49% native @ 32 cores – 416. gamess 84% native @ 16 cores Max % of native depends on runtime (fixed 1000 samples per app)

Uppsala University DEMO Uppsala Architecture Research Team 2021 -03 -04 | 19

Uppsala University DEMO Uppsala Architecture Research Team 2021 -03 -04 | 19

Uppsala University Uppsala Architecture Research Team 2021 -03 -04 | 20 p. FSA (Parallel

Uppsala University Uppsala Architecture Research Team 2021 -03 -04 | 20 p. FSA (Parallel Full Speed Ahead Simulation) Virtualized Fast Forward – v. FF CPU module in gem 5 distribution today Virtualized Sampled Simulation – – Serial FSA: 300 MIPS Parallel FSA: 2, 000 MIPS (8 cores) Good scaling (tested to 32 cores) Cache warmup analysis Great Accuracy – SMARTS: 1. 9% IPC error – p. FSA: 2. 0% IPC error Future Work – Multicore Single-core Simulation SMARTS accuracy 1000 x faster Commodity hardware Community software