NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA SURATHKAL DEPARTMENT OF

  • Slides: 31
Download presentation
NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA, SURATHKAL DEPARTMENT OF COMPUTER SCIENCE & ENGGINERING Presentation on

NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA, SURATHKAL DEPARTMENT OF COMPUTER SCIENCE & ENGGINERING Presentation on ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Publisher’s: DANIEL SANCHEZ CHRISTOS KOZYRAKIS Presented By: Vaibhav Ashtikar(13 IS 24 F) Govind Dhonddev(13 IS 06 F)

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Architectural Simulator Input Piece of

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Architectural Simulator Input Piece of software: Modelling computer system/components Predicts o/p & performance Evaluating different hardware designs without building costly physical hardware systems. Enabling the opportunities to access non-existing computer components or systems. Obtaining detailed performance metrics: A single execution of simulators can often generate a large set of performance data. Debugging: Debugging on real hardware typically require re-booting and re-running the code to reproduce the problems. In contrast, some simulators have a fully controlled environment and allow software developers to run code backward once an error is detected.

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS IDEAL Architectural Simulator Input FAST

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS IDEAL Architectural Simulator Input FAST ACCURATE Piece of software: Modelling computer system/components Predicts o/p & performance Execute wide range of WORKLOADS Easy to use, easy to modify

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Problem: Architectural simulation is Time

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Problem: Architectural simulation is Time Consuming Current detailed simulators are slow (~200 KIPS) Problem: Time to simulate 1000 cores @ 2 GHz for 1 second at 200 KIPS: 4 months 200 MIPS: 3 hours Simulation performance wall • More complex targets (multicore, memory hierarchy, …) • ard to parallelize

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Architectural Simulator Sequential Simulation More

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Architectural Simulator Sequential Simulation More cores to be simulated ; More slower sequential simulation. Parallel simulation Scaling poorly due to excessive synchronization. tradeoff Sacrifice accuracy by allowing event reordering

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Tradeoff between speed and accuracy:

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Tradeoff between speed and accuracy: Speed Accuracy Performance measures OOO Parallel Scaling Sequential

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Speed up detailed core models

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Speed up detailed core models ZSIM Bound Weave Light weight user level virtualization

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS ZSIM FAST Speed up detailed

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS ZSIM FAST Speed up detailed core models With instruction driven timing models that uses DBT(Dynamic Binary Translation)

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS ZSIM ACCURACY Bound Weave 2

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS ZSIM ACCURACY Bound Weave 2 phase parallelization technology that scales parallel simulation on multicore. .

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS ZSIM Wide range of workloads

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS ZSIM Wide range of workloads Light weight user level virtualization To support complex workloads. E. g. multiprogramming client server based applications etc. To bridge user-level/full system gap

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Dynamic Binary Translation Translated Code

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Dynamic Binary Translation Translated Code • Simulate basic block using host instructions Load t 1, sim. Regs[1] Load t 2, 16(t 1) Store t 2, • Binary Code sim. Regs[3] Load r 3, 16(r 1) Add r 4, r 3, r 2 Jump 0 x 48074 Load t 1, sim. Regs[2] Load t 2, sim. Regs[3] Add t 3, t 1, t 2 Store t 3, simregs[4] Store 0 x 48074, sim. Pc J dispatch_loop

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Dynamic Binary Translation • Modeling

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Dynamic Binary Translation • Modeling thousand-core simulator with parallelization alone is not sufficient. • Instrumentation based approach to eliminate need for functional modeling of X 86. Timing Based Model Simple Core Model OOO Core Model

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Simple Core Model (SCM) •

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Simple Core Model (SCM) • Instrument Load and Store instructions. • SCM counts cycles, instructions, derives memory hierarchy. • Simulated up to 90 MIPS per simulated Core • Pitfalls: • Doesn’t represent ooo model used in desktops, server and processor chips.

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS OOO Core Model • Models

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS OOO Core Model • Models • • • Branch prediction Instruction length Pre-decoder instruction decoding Issue stalls Register renaming • Conventional Simulators with OOO model execute around 100 KIPS. • ZSIM accelerates OOO core by pushing most of the work at instrumentation phase

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS OOO core modeling • Basic

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS OOO core modeling • Basic Block mov (%rbp), %rcx add %rax, %rbx mov %rdx, (%rbp) ja 40530 a Instrumented basic block +Basic Block Descriptor Basic. Block(Decoded. BBL) Load(addr = -0 x 38(%rbp)) mov -0 x 38(%rbp), %rcx lea -0 x 2040(%rbp), %rdx add %rax, %rdx mov %rdx, -0 x 2068(%rbp) Store(addr = -0 x 2068(%rbp)) cmp $0 x 1 fff, %rax jne 10840530 a Ins →μop decoding μop dependencies, functional units, latency Front-end delays • Instruction Driven Approach: Simulate all stages at once for each instruction / μ-operation. • Schedule of given μ operation must not depend on future μ operation. • Execution time of every μ operation must be known in advance.

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Parallelism and Interference • Path-altering

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Parallelism and Interference • Path-altering Interference • Two accesses if simulated in out of order changes their paths through memory hierarchy. • Root cause: • Two accesses address to same line (except both reads) • Second access if executed out of order, causes first access as miss.

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Parallelism and Interference • Path-Preserving

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Parallelism and Interference • Path-Preserving Interference • Two accesses if simulated out of order changes their timing but path to memory hierarchy remains unaffected. . • Ex. 2 accesses to different Cache sets in same bank. • In small intervals(1 -10 K cycles) path altering interference is very rare(<1 in 10 K accesses)

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Bound-weave Algorithm • Need accuracy

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Bound-weave Algorithm • Need accuracy on path-altering interference Bound-weave Algorithm 1. Bound Phase 2. Weave Phase In this interval, each core is simulated for specific small interval Zero load latency during the interval. Parallel simulation of core for prior knowledge of events to scale efficiently.

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Bound Phase • Limit the

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Bound Phase • Limit the skew between simulated Cores • Thread execution in parallel and sync for interval barrier. • Moderate parallelism • Allow as many threads as host hardware threads run concurrently. • Ex. 1024 -core simulation on host with 32 hardware threads, barrier only wakes up 32 threads at each time interval. • Avoiding systematic bias • At end of time interval, barrier shuffles thread wake up order to avoid consistently prioritizing a few threads.

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Bound-Weave Example • 2 -core

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Bound-Weave Example • 2 -core host simulating 4 -core system • 1000 -cycles intervals • Dividing components among 2 domains

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Complex Workloads • Multi process

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Complex Workloads • Multi process simulation • Scheduler – simple round robin scheduler • Avoiding Simulator-OS Deadlock • Time Virtualization – virtualize rdtsc counter. • System Virtualization – pregenerated virtual instruction. • Fast forwarding – DBT to perform pre-processing fast close to native speed. • Challenge: Accounting for OS execution time.

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Accuracy Simulated Real • 18

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Accuracy Simulated Real • 18 out of 29 benchmarks, zsim is within 10% of real system (average performance error around 9. 7 ) • Absolute performance error

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Accuracy- Cache Level • MPKI

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Accuracy- Cache Level • MPKI Error • Error rate increases along with memory hierarchy. • Currently TLB misses not modelled.

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Thousand Core performance • Single

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Thousand Core performance • Single system in sequence simulation: • 1. 32 trillion instructions takes 1. 8 hours(IPC 1 -NC) to 8. 9 hours(OOOC).

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Speed Up • Performance simulation

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Speed Up • Performance simulation of single threaded ZSIM on SPEC 2006 using 4 models: IPC 1 or OOO cores with and without contention

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Speed Up • Average ZSIM

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Speed Up • Average ZSIM speedup on workloads as we increase host threads from 1 to 32(16 cores with 2 hardware threads/cores)

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Comparison With other simulators •

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Comparison With other simulators • Parallel simulators reports 1 -10 MIPS. • Many constraints – host, workloads, memory intensive application, leads to potential difference. • ZSIM is 2 -3 orders of magnitude faster than other simulators.

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Conclusion • New techniques to

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Conclusion • New techniques to achieve speed and accuracy. • DBT based Timing Model • Bound-Weave Parallelization • Lightweight virtualization of user process • Leads to speedup of 1 -1500 MIPS on thousand core simulation.

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS References • [1] “ZSIM: Fast

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS References • [1] “ZSIM: Fast and Accurate Micro architectural Simulation of Thousand-Core Systems ” Daniel Sanchez, Christos Kozyrakis, ISCA 2013