HETSIM SIMULATING LARGESCALE HETEROGENEOUS SYSTEMS USING A TRACEDRIVEN

  • Slides: 26
Download presentation
HETSIM SIMULATING LARGE-SCALE HETEROGENEOUS SYSTEMS USING A TRACE-DRIVEN, SYNCHRONIZATION AND DEPENDENCY-AWARE FRAMEWORK Subhankar Pal,

HETSIM SIMULATING LARGE-SCALE HETEROGENEOUS SYSTEMS USING A TRACE-DRIVEN, SYNCHRONIZATION AND DEPENDENCY-AWARE FRAMEWORK Subhankar Pal, Kuba Kaszyk, Siying Feng, Björn Franke, Murray Cole, Michael O’Boyle, Trevor Mudge, Ronald G. Dreslinski Email: subh – AT – umich. edu

BACKGROUND IEEE International Symposium on Workload Characterization 10/28/20 2

BACKGROUND IEEE International Symposium on Workload Characterization 10/28/20 2

A SHIFT IN COMPUTING PARADIGM Faster compute Oo. O In. O Acc Acc PE

A SHIFT IN COMPUTING PARADIGM Faster compute Oo. O In. O Acc Acc PE Array … More memory bandwidth … … Oo. O Acc … In. O … … In. O GPU CPU Scratchpad Mem. CGRA Block Diagram of a Highly Heterogeneous So. C Ops/s M HB R D GD R DD Compute bound applications Memory bound applications Ops/B Roofline Model of Different Platforms and Applications Adapted from [*] • End of Dennard Scaling has spurred the growth of heterogeneous architectures, e. g. CPUs, accelerators, CGRAs on the same So. C • Compute improvement through heterogeneity has outpaced memory bandwidth growth • Leading to applications becoming increasing memory-bound [*] https: //www. micron. com/about/blog/2019/february/ai-matters-getting-to-the-heart-of-data-intelligence-with-memory-and-storage IEEE International Symposium on Workload Characterization 10/28/20 3

FROM SIMULATION TO PROTOTYPING • Such approaches generate offline traces of an application and

FROM SIMULATION TO PROTOTYPING • Such approaches generate offline traces of an application and feed them into a model of the target hardware for performance estimation • Demonstrates good trade-off between simulation speed and accuracy • Prior work demonstrated use of trace-based simulation for chip multiprocessors (CMPs) • Lack of a solution to simulate heterogeneous systems with 1000 s of cores • Insufficient information captured in the trace format Accuracy of measurement (time, power, etc. ) • Trace-driven architectural simulation is widely adopted for early stage design-space exploration Target Chip Arch. Simulators Functional Simulators FPGA Emulation Adapted from [*] Resource requirement (time, cost, etc. ) Trade-Offs between Accuracy and Resource Requirement • Tokens defining inter-PE communication • Annotations of dependent memory addresses • Program counters, to support PC-based prefetching [*] Takamaeda-Yamazaki et al. , “An FPGA-based scalable simulation accelerator for tile architectures”, ACM SIGARCH Computer Architecture News, December 2011 IEEE International Symposium on Workload Characterization 10/28/20 4

PROPOSED APPROACH IEEE International Symposium on Workload Characterization 10/28/20 6

PROPOSED APPROACH IEEE International Symposium on Workload Characterization 10/28/20 6

INPUTS FROM THE USER • Detailed architectural model of the target • Comprises a

INPUTS FROM THE USER • Detailed architectural model of the target • Comprises a set of parallel PEs and a memory subsystem • Het. Sim also supports a standalone mode that doesn’t require detailed PE models • Power model of the target (optional) Target Architecture Model PE … PE Memory Subsystem Power Model CPP • Any power modeling tool compatible with the architectural model • Multithreaded C/C++ implementation of target application • Modified to use “primitive” calls • Primitives refer to HW functionality that can be emulated in SW • Het. Sim supports a set of primitives commonly used in accelerators • Can also be extended by the user – more on that later IEEE International Symposium on Workload Characterization [. . . ] t_args[tid]. gbar = global_bar; pthread_create(threads+i, NULL, work, &t_args[i]); [. . . ] __push(i + 1, (uintptr_t)global_bar); [. . . ] 10/28/20 7

INPUTS FROM THE USER • Tracing specification for primitives in the target hardware •

INPUTS FROM THE USER • Tracing specification for primitives in the target hardware • Addresses to which loads/stores should be captured • Types of IR-level instructions to be captured • Latencies and token-names for primitives, etc. • Native SMP machine with sufficient resources for trace generation and replay • More threads, faster secondary storage faster trace generation • Faster single-threaded performance, more memory capacity faster trace replay IEEE International Symposium on Workload Characterization 10/28/20 8

STEP 1: GENERATING COMPILER • PLUGIN This is done once for each target architecture

STEP 1: GENERATING COMPILER • PLUGIN This is done once for each target architecture • The generation module takes the user specification as input • Processes it in 2 phases – JSON parsing and model-specific extraction class function(): def __init__(self, pe, func, args): self. func_signature = func self. token = args["token"]. split(' ')[0] self. trace_func_llvm_name = pe. name + func. split('(')[0] self. trace_func = "_C_" + self. trace_func_llvm_name self. runtime_func = func self. cycles = args["cycles"] … class processing_element(): def __init__(self, name): self. ids = [] self. num_elems = len(self. ids) self. functions = [] self. name = name def add_func(self, func, args): self. functions. append(function(self, func, args)) {; } JSON generate_model. py Tracing Specification Generator clang Plugin Tracing Library Automatic Generation of Tracing Infrastructure IEEE International Symposium on Workload Characterization spec. json "llvm_instr": ["Add", "Mul”], "pe": { "mgr": { "id": [0], "__push(unsigned int, unsigned long)" : { "token": "PUSH #0", "cycles": 1, "enable": 1 }, … "wrkr": { "id": [1, 2, 3, 4, 5, 6, 7, 8], "__pop(unsigned int)": { "token": "POP #0", "cycles": 1 , "enable": 1 }, … } def parse(json): addr_list = [] pe_list = [] for entry in json["addr_space"]: pair = (json["addr_space"][entry]['start’], json["addr_space"][entry]['end']) addr_list. append(pair) for pe in json["pe"]: PE = processing_element(pe) PE. ids. append(json["pe"][pe]["id"]) for entry in json["pe"][pe]: if entry != "id": if json["pe"][pe][entry]["enable"] == 1: PE. add_func(entry, json["pe"][pe][entry]) pe_list. append(PE) llvm_instr = json["llvm_instr"] return llvm_instr, addr_list, pe_list 10/28/20 9

GENERATING COMPILER PLUGIN • The next steps generate an architecture-specific compiler pass and a

GENERATING COMPILER PLUGIN • The next steps generate an architecture-specific compiler pass and a tracing library hetsim-analysis. cpp clang Plugin generate_model. py def generate(pes): for pe in pes: for f in pe. functions: if count == 0: identify_call += 'nif (std: : string("' + f. func_signature + '''") == func_name) { emit_count++; llvm: : Small. Vector<llvm: : Value*, 2> args; nintrinsics. push_back({&I, std: : string("''' + f. func_signature + '"), args}); n}''' … … // identify loads, stores, stalls identify. LSS(); // identify intrinsics if (I. get. Opcode() == llvm: : Instruction: : Call) { std: : string func_name = cast<Call. Inst>(I). get. Called. Function (). get. Name(). str(); if (std: : string("__pop(int)") == func_name) { emit_count++; llvm: : Small. Vector<llvm: : Value*, 2> args; intrinsics. push_back({&I, std: : string("__pop(int)"), args}); } … // instrument loads, stores, stalls instrument. LSS(); // instrument intrinsics for (auto I : intrinsics) { if (I. trace_func == std: : string("__pop(int)")) { if (auto *func = M. get. Function(intrinsic_trace_map[I. trace_func])) { llvm: : Small. Vector<llvm: : Value*, 1> args; args. push_back(I. instr->get. Operand(0)); auto *call = llvm: : Call. Inst: : Create(llvm: : cast<llvm: : Function>(func), args, "", I. instr); } } … } IEEE International Symposium on Workload Characterization hetsim-default_rt. cpp void _C_wrkr__pop(unsigned int arg 0){ if (is_log_open()) { __register_entry(get_id(), "POP " + std: : to_string((uint 64_t)arg 0) + " 1n"); } } Tracing Library void _C_wrkr__barrier_init( pthread_barrier_t* arg 0, unsigned int arg 1){ if (is_log_open()) { __register_entry(get_id(), "BARINIT " + to. Hex(arg 0) + " " + std: : to_string((uint 64_t)arg 1) + " 10n"); } } void _C_wrkr__barrier_wait( pthread_barrier_t* arg 0){ if (is_log_open()) { __register_entry(get_id(), "BARWAIT " + to. Hex(arg 0) + " 10n"); __register_entry(get_id(), "LDB " + to. Hex(arg 0) + " ( )n"); __register_entry(get_id(), "STB " + to. Hex(arg 0) + " ( )n"); } } 10/28/20 10

COMPILATION AND EMULATOR VERIFICATION 0111 0000 0011 Instrumented Binary clang Emulation Lib Primitive #1

COMPILATION AND EMULATOR VERIFICATION 0111 0000 0011 Instrumented Binary clang Emulation Lib Primitive #1 clang/ gcc Primitive #2 Primitive #3 clang Plugin Tracing Library 0111 0000 1101 Application Binary Functionally Verified Code Native Execution on SMP • Two binaries are generated using a pre-packaged library of primitives for emulation • The un-instrumented binary is run on the native machine to verify functional correctness • The instrumented binary, used for trace generation • Instrumented functions include • Implicit primitives: memory accesses, compute operations and misc. operations • Explicit primitives: hardware-specific functionality modeled as function calls IEEE International Symposium on Workload Characterization 10/28/20 11

TRACE GENERATION AND REPLAY • Instrumentation calls are made to the tracing library, which

TRACE GENERATION AND REPLAY • Instrumentation calls are made to the tracing library, which emits trace tokens file(s) • The tracing API is exposed to the user, for manual instrumentation, e. g. if fine-granular performance tuning is required // loop over nsteps that is set by the mgr for (int i = 0; i < nsteps; ++i) { // increment my_shared_var by thread ID *my_shared_var += tid; … … %gbar 2 = getelementptr inbounds i 8, i 8* %args, i 64 8 %2 = bitcast i 8* %gbar 2 to %union. pthread_barrier_t** %3 = call i 32 (. . . ) @emit_load(%union. pthread_barrier_t** %2, i 32 1, i 32 0) %4 = load %union. pthread_barrier_t*, %union. pthread_barrier_t** %2, align 8, !tbaa !8 … %conv = trunc i 64 %call to i 32 %7 = call i 32 (. . . ) @_C_wrkr__pop(i 32 0) %call 3 = tail call i 64 @_Z 5__popj(i 32 0) %8 = inttoptr i 64 %call 3 to i 32* … Instrumentation of Tracing Calls into the IR • The instrumented binary is run through the native machine to generate trace file(s) 0111 0000 0011 … • The detailed model is modified to swap out the cores with trace replay engines (TREs) • The traces are fed through the model and simulation is run workq_mutex. ll workq_mutex. cpp … for (int i = 0; i < ITER; ++i) { // receive ack from mgr to go __pop(0); Instrumented Binary TRE Library Native Execution on SMP TRC … TRC TRC Target Architecture Model Compute TRE Swap Subsystem Memory Subsystem Power Model Performance Estimates Power Estimates Trace Generation and Replay Step IEEE International Symposium on Workload Characterization 10/28/20 12

TRACE FORMAT • Figure shows generated traces for two PEs, for the example of

TRACE FORMAT • Figure shows generated traces for two PEs, for the example of a vector-addition application • PE-to-PE communication is achieved through tokens corresponding to primitives • Each memory access token is annotated with the virtual PC (prefixed with @) • The arrows indicate dependencies captured between the loads and the addition operation • This enables modeling multiple outstanding accesses by tracking these dependencies IEEE International Symposium on Workload Characterization 10/28/20 13

USING AND EXTENDING HETSIM IEEE International Symposium on Workload Characterization 10/28/20 14

USING AND EXTENDING HETSIM IEEE International Symposium on Workload Characterization 10/28/20 14

HETSIM USE CASES Modify user specification for target Implement model in and hook up

HETSIM USE CASES Modify user specification for target Implement model in and hook up TREs Write/modify app using primitives CPP {; } JSON Verify functionality of app Generate compiler plugin and tracing lib Generate traces using plugin Gather statistics TRC TRC Run on TREenabled model Typical Usage Flow using Het. Sim with a New Target Architecture • In addition to supporting an arbitrary architectural model compatible with gem 5 -SE, Het. Sim enables • Profiling new applications on existing models • Exploring new design points on an existing model • Simulating with modifications to an existing model • Design space exploration with different PE Types • Modeling ISA Extensions • Het. Sim provides a standalone mode, where the TREs are directly used in the model instead of detailed cores IEEE International Symposium on Workload Characterization 10/28/20 15

SUPPORT FOR CUSTOM PRIMITIVES – AN EXAMPLE Emulation library code (emu/src/util. cpp) User spec

SUPPORT FOR CUSTOM PRIMITIVES – AN EXAMPLE Emulation library code (emu/src/util. cpp) User spec file (spec/spec. json) gem 5 TRE code (gem 5/src/cpu/tre. cpp) Corresponding __wake() call notifies this cv after incrementing sleep_cntr TRE returns without scheduling tick() and it is done by the TRE that calls WAKE IEEE International Symposium on Workload Characterization 10/28/20 16

EVALUATION IEEE International Symposium on Workload Characterization 10/28/20 17

EVALUATION IEEE International Symposium on Workload Characterization 10/28/20 17

EXPERIMENTAL SETUP Arith. Boundedne Intensity ss Ge. MM High Compute Ge. MV Med Memory

EXPERIMENTAL SETUP Arith. Boundedne Intensity ss Ge. MM High Compute Ge. MV Med Memory Sp. MM Low Memory General purpose processing element Crossbar Local control processor $ $ $ $ Memory $ $ Memory • Evaluated Het. Sim on two target architectures • Transmuter is a recently proposed reconfigurable architecture • L 0$/ SPM • Evaluated three workloads (Ge. MM, Ge. MV and Sp. MM) • Implemented in two hardware configurations (shared cache – shared cache (S) and private cache – private cache (P)) a fixed function PE, an Arm Cortex-M 0 and Arm Cortex-M 4 F Native system: a 32 -core AMD Ryzen Threadripper with 128 GB main memory IEEE International Symposium on Workload Characterization PE PE L 0$/ SPM 40 nm Sp. MM Accelerating Chip with Overview of a Tile [**] Tiled architecture with one LCP core and many tiny in-order cores (GPEs) per tile with L 1/L 2 level that can be reconfigured • M 4 F Crossbar • An Sp. MM accelerating chip composed of three PE types PE M 0 $ $ High-Level Overview of Transmuter[*] and Workloads Evaluated • PE [*] S. Pal, S. Feng, D. -h. Park, S. Kim, A. Amarnath, C. -S. Yang, X. He, J. Beaumont, K. May, Y. Xiong, K. Kaszyk, J. Morton, J. Sun, M. O’Boyle, M. Cole, C. Chakrabarti, D. Blaauw, H. -S. Kim, T. Mudge, R. Dreslinski, ”Transmuter: Bridging The Efficiency Gap Using Memory And Dataflow Reconfiguration”, PACT 2020 [**] S. Pal, D. -h. Park, S. Feng, P. Gao, J. Tan, A. Rovinski, S. Xie, C. Zhao, A. Amarnath, T. Wesley, J. Beaumont, K. -Y. Chen, C. Chakrabarti, M. Taylor, T. Mudge, D. Blaauw, H. -S. Kim and R. Dreslinski, “A 7. 3 M Output Non-Zeros/J Sparse Matrix-Matrix Multiplication Accelerator Using Memory Reconfiguration in 40 nm”, VLSI 2019 10/28/20 18

Trace Generation 5 000 4 500 4 000 3 500 3 000 2 500

Trace Generation 5 000 4 500 4 000 3 500 3 000 2 500 2 000 1 500 1 000 500 0 Trace Replay Sp. MM – Merge Ge. MV Dim. 8 k 64 x 64 32 x 64 16 x 32 32 x 32 16 x 16 8 x 8 8 x 16 4 x 8 64 x 64 32 x 64 16 x 32 32 x 32 8 x 16 16 x 16 4 x 8 8 x 8 64 x 64 32 x 32 Ge. MM Dim. 512 32 x 64 16 x 32 16 x 16 8 x 8 8 x 16 Sp. MM – Multiply 4 x 8 Simulation Time (s) SCALABILITY STUDIES USING HETSIM Sp. MM Dim. 4 k, den. 0. 64% Scalability and Timing Profile of Het. Sim with Transmuter Target Het. Sim-based Simulation Results Showing Bandwidth and Throughput Scaling with Number of Cores on Transmuter • Trace generation overhead plateaus with increasing number of simulated cores, except for Sp. MM • Trace replay using gem 5 scales with the simulated system size • Trace generation is a one-time cost for experiments that do not affect the instructions executed • E. g. sizing hardware structures, sizing bus widths, sweeping clock speed, sweeping cache capacities, etc. • In our experiments, the cost for trace generation 0. 1 - 2. 5× the time for one trace replay run IEEE International Symposium on Workload Characterization 10/28/20 19

CORRELATION & COMPARISON W/ DETAILED • MODEL • Ge. MM Ge. MV Speedup of

CORRELATION & COMPARISON W/ DETAILED • MODEL • Ge. MM Ge. MV Speedup of 3. 2 -10. 4× (avg. 5. 0×) Runtime deviation: 0. 2 -57. 0% (avg. 15. 1%) • Power deviation: 0. 0 -24. 2% (avg. 10. 9%) Sp. MM Accuracy improves with increasing problem size Accuracy for private cache is better because it incurs more cache misses IEEE International Symposium on Workload Characterization Blocking version is more accurate because of selfcorrection with synch. Power deviations are small since traces mostly consist of memory accesses Differences in # accesses due to dynamism such as contention for mutexes Speedups improve for larger problem sizes 10/28/20 20

ACCURACY OF DESIGN-SPACE EXPLORATION � P � S � S Performance with Problem Size

ACCURACY OF DESIGN-SPACE EXPLORATION � P � S � S Performance with Problem Size for Ge. MM � P �S � P Performance with Problem Size for Ge. MV � S �P � P � P Performance with Problem Size for Sp. MM • For Ge. MM, Het. Sim made the same prediction about the fast configuration 7 of 8 experiments • For both Ge. MV and Sp. MM, Het. Sim achieved 100% accuracy in predicting the faster Transmuter configuration • Takeaway: Het. Sim can be used effectively despite % timing errors for sweeps of categorical parameters IEEE International Symposium on Workload Characterization 10/28/20 21

EVALUATION AND CORRELATION WITH CHIP Performance with Number of PEs and Available Bandwidth for

EVALUATION AND CORRELATION WITH CHIP Performance with Number of PEs and Available Bandwidth for Sp. MM • Het. Sim in standalone mode used to model the 40 nm chip with three heterogeneous PE types • Average error in estimated performance of Het. Sim compared to the chip • 32% for the multiply phase • 16% for the merge phase • The higher error for multiply is due to the behavior of the custom off-chip memory interface, is modeled approximately using the stock DDR model in gem 5 IEEE International Symposium on Workload Characterization 10/28/20 22

KNOWN SOURCES OF ERROR • Effect of not modeling pipelined execution within a PE

KNOWN SOURCES OF ERROR • Effect of not modeling pipelined execution within a PE • Effect of frequent synchronization and busy-waiting • Differences due to trace generation on a system with a different ISA • Effect of instrumenting at the RISC-like LLVM-IR level • Impact of over-filtering primitives in the user specification file • Effect of bandwidth sharing and impact due to not accounting for I-cache misses • Power impact of instruction fetch/decode in the pipeline, and SRAM accesses within I-caches IEEE International Symposium on Workload Characterization 10/28/20 23

CONCLUSION • Proposed Het. Sim to address the issue of simulating heterogeneous systems with

CONCLUSION • Proposed Het. Sim to address the issue of simulating heterogeneous systems with thousands of cores within practical time and resource limitations • Introduced the notion of hardware primitives and implemented common primitives that are packaged with the Het. Sim release • Enabled support for complex in-order cores and prefetching by embedding crucial information, e. g. dependent memory addresses and program counter values, within traces • Het. Sim speeds up measured simulation times by 3. 2 -10. 4× over detailed gem 5 models • Deviations of 0. 2 -57. 0% and 0 -24. 2% in terms of simulated time and power IEEE International Symposium on Workload Characterization 10/28/20 24

HETSIM IS AVAILABLE ON GITHUB! https: //github. com/umich-cadre/hetsim-gem 5/tree/master/demos/iiswc-20/tutorial. ipynb IEEE International Symposium on

HETSIM IS AVAILABLE ON GITHUB! https: //github. com/umich-cadre/hetsim-gem 5/tree/master/demos/iiswc-20/tutorial. ipynb IEEE International Symposium on Workload Characterization 10/28/20 25

ACKNOWLEDGEMENT • The material is based on research sponsored by Air Force Research Lab-

ACKNOWLEDGEMENT • The material is based on research sponsored by Air Force Research Lab- oratory (AFRL) and Defense Advanced Research Projects Agency (DARPA) under agreement number FA 8650 -18 -2 -7864. The U. S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Laboratory (AFRL) and Defense Advanced Research Projects Agency (DARPA) or the U. S. Government. IEEE International Symposium on Workload Characterization 10/28/20 26

THANK YOU FOR ATTENDING! For questions, please email Subhankar Pal at subh – AT

THANK YOU FOR ATTENDING! For questions, please email Subhankar Pal at subh – AT – umich. edu IEEE International Symposium on Workload Characterization 10/28/20 27