Integration for Heterogeneous So C Modeling Yakun Sophia
- Slides: 54
Integration for Heterogeneous So. C Modeling Yakun Sophia Shao, Sam Xi, Gu-Yeon Wei, David Brooks Harvard University 1
Today’s Accelerator-CPU Integration • Simple interface to accelerators: DMA • Easy to integrate lots of IP • Hard to program and share data Core L 1 $ … L 2 $ Core L 1 $ Acc #1 Acc #n SPAD On-Chip System Bus DMA DRAM 2
Today’s Accelerator-CPU Integration • Simple interface to accelerators: DMA • Easy to integrate lots of IP • Hard to program and share data Core L 1 $ … L 2 $ Core L 1 $ Acc #1 Acc #n SPAD On-Chip System Bus DMA DRAM 3
Typical DMA Flow • Flush and invalidate input data from CPU caches. • Invalidate a region of memory to be used for receiving accelerator output. • Program a buffer descriptor describing the transfer (start, length, source, destination). – When data is large, program multiple descriptors • Initiate accelerator. • Initiate data transfer. • Wait for accelerator to complete. 4
DMA can be very expensive Only 20% of total time! 16 -way parallel md-knn accelerator 5
Co-Design vs. Isolated Design 6
Co-Design vs. Isolated Design 7
Co-Design vs. Isolated Design No need to build such an aggressively parallel design! 8
gem 5 -Aladdin: An So. C Simulator 9
Features • End-to-end simulation of accelerated workloads. • Models hardware-managed caches and DMA + scratchpad memory systems. • Supports multiple accelerators. • Enables system-level studies of accelerator-centric platforms. • Xenon: A powerful design sweep system. • Highly configurable and extensible. 10
DMA Engine • Extend the existing DMA engine in gem 5 to accelerators. • Special dma. Load and dma. Store functions. – Insert into accelerated kernel. – Trace will capture them. – gem 5 -Aladdin will handle them. • Data is sent back and forth as required. • Analytical model for cache flush and invalidation latency. 11
DMA Engine /* Code representing the accelerator */ void fft 1 D_512(TYPE work_x[512], TYPE work_y[512]){ int tid, hi, lo, stride; /* more setup */ } 12
DMA Engine /* Code representing the accelerator */ void fft 1 D_512(TYPE work_x[512], TYPE work_y[512]){ int tid, hi, lo, stride; /* more setup */ dma. Load(&work_x[0], 0, 512 * sizeof(TYPE)); dma. Load(&work_y[0], 0, 512 * sizeof(TYPE)); } 13
DMA Engine /* Code representing the accelerator */ void fft 1 D_512(TYPE work_x[512], TYPE work_y[512]){ int tid, hi, lo, stride; /* more setup */ dma. Load(&work_x[0], 0, 512 * sizeof(TYPE)); dma. Load(&work_y[0], 0, 512 * sizeof(TYPE)); /* Run FFT here. . . */ } 14
DMA Engine /* Code representing the accelerator */ void fft 1 D_512(TYPE work_x[512], TYPE work_y[512]){ int tid, hi, lo, stride; /* more setup */ dma. Load(&work_x[0], 0, 512 * sizeof(TYPE)); dma. Load(&work_y[0], 0, 512 * sizeof(TYPE)); /* Run FFT here. . . */ dma. Store(&work_x[0], 0, 512 * sizeof(TYPE)); dma. Store(&work_y[0], 0, 512 * sizeof(TYPE)); } 15
Caches and Virtual Memory • Gaining traction on multiple platforms. – Intel Quick. Assist QPI-Based FPGA Accelerator Platform (QAP) – IBM POWER 8’s Coherent Accelerator Processor Interface (CAPI) • System vendors provide a Host Service Layer with virtual memory and cache coherence support. • Host service layer communicates with CPUs through an agent. FPGA Processors Core L 1 $ … L 2 $ Core L 1 $ QPI/PCIe Acc Agent Accelerator Host Service Layer 16
Caches and Virtual Memory • Accelerator caches are connected directly to system bus. • Support for multi-level cache hierarchies. • Hybrid memory system: can use both caches and scratchpads. • MOESI coherence protocol. • Special Aladdin TLB model. – Map trace address space to simulated address space. 17
Two ways to run gem 5 -Aladdin • Standalone – Aladdin + gem 5 memory system models – No CPUs in the system – Easily test accelerator and memory system designs • With-CPU – Write user program to invoke one or more accelerators. – Evaluate end-to-end workload performance. 18
Validation • Implemented accelerators in Vivado HLS • Designed complete system in Vivado Design Suite 2015. 1. 19
CASE STUDY: REDUCING DMA OVERHEADS 20
Reducing DMA Overhead 21
Reducing DMA Overhead 22
Reducing DMA Overhead 23
DMA Optimization Results 24
DMA Optimization Results Overlap of flush and data transfer 25
DMA Optimization Results Overlap of data transfer and compute 26
DMA Optimization Results md-knn is able to completely overlap computation with communication! 27
DMA Optimization Results 28
CPU – Accelerator Cosimulation • • CPU can invoke an attached accelerator. We use the ioctl system call. Communicate status through shared memory. Spin wait for accelerator, or do something else (e. g. start another accelerator). 29
Code example /* Code running on the CPU. */ void run_benchmark(TYPE work_x[512], TYPE work_y[512]) { } 30
Code example /* Code running on the CPU. */ void run_benchmark(TYPE work_x[512], TYPE work_y[512]) { /* Establish a mapping from simulated to trace * address space */ map. Array. To. Accelerator(MACHSUITE_FFT_TRANSPOSE, "work_x", work_x, sizeof(work_x)); } Associate this array name with the addresses of memory accesses in the trace. Starting address and length of one memory region that the accelerator can access. 31 ioctl request code
Code example /* Code running on the CPU. */ void run_benchmark(TYPE work_x[512], TYPE work_y[512]) { /* Establish a mapping from simulated to trace * address space */ map. Array. To. Accelerator(MACHSUITE_FFT_TRANSPOSE, "work_x", work_x, sizeof(work_x)); map. Array. To. Accelerator(MACHSUITE_FFT_TRANSPOSE, "work_y", work_y, sizeof(work_y)); } 32
Code example /* Code running on the CPU. */ void run_benchmark(TYPE work_x[512], TYPE work_y[512]) { /* Establish a mapping from simulated to trace * address space */ map. Array. To. Accelerator(MACHSUITE_FFT_TRANSPOSE, "work_x", work_x, sizeof(work_x)); map. Array. To. Accelerator(MACHSUITE_FFT_TRANSPOSE, "work_y", work_y, sizeof(work_y)); // Start the accelerator and spin until it finishes. invoke. Accelerator. And. Block(MACHSUITE_FFT_TRANSPOSE); } 33
One accelerator, multiple calls • Call an accelerated function in a loop with different data each time. CPU code i=0 ACCEL CPU code i=1 ACCEL 34 CPU code i=2 ACCEL
One accelerator, multiple calls • Build the trace as usual. • Trace will contain all iterations of this loop. call i=0 i=1 i=2 ACCEL ret call ret 35 call ret
One accelerator, multiple calls • Aladdin identifies call and ret instructions to mark as boundaries of an invocation. call i=0 i=1 i=2 ACCEL ret call ret 36 call ret
One accelerator, multiple calls • Aladdin only reads this part of the trace. • Continue as usual. call i=0 i=1 i=2 ACCEL ret call ret 37 call ret
One accelerator, multiple calls • On the next iteration, Aladdin resumes reading the trace at the last position. call i=0 i=1 i=2 ACCEL ret call ret 38 call ret
Multiple accelerators • Build the trace as usual. Then: • Divide them up into separate traces for each kernel. – In the user code, we call invoke. Accelerator() with a different request code for each accelerator. – Easier to distinguish output of different accelerators. • Leave it as a single trace. – invoke. Accelerator() has the same request code each time, even though a different workload is modeled. 39
How can I use gem 5 -Aladdin? • Investigate optimizations to the DMA flow. • Study cache-based accelerators. • Study impact of system-level effects on accelerator design. • Multi-accelerator systems. • Near-data processing. • All these will require design sweeps! 40
Xenon: Design Sweep System • A small declarative command language for generating design sweep configurations. • Implemented as a Python embedded DSL. • Highly extensible. • Not gem 5 -Aladdin specific. • Not limited to sweeping parameters on benchmarks. • Why “Xenon”? 41
1, 000 ft view of Xenon • Xenon operates on Python objects and attributes. • Define a data model • Instantiate the data model • Execute Xenon commands over the data 42
Xenon: Data Model md-knn md_kernel force_x • • • cycle_time pipelining force_y partition_type partition_factor memory_type loop_i • • • unrolling loop_j 43 force_z
Xenon: Commands set unrolling 4 set partition_type “cyclic” set unrolling for md_knn. * 8 set partition_type for md_knn. force_x “block” sweep cycle_time from 1 to 5 sweep partition_factor from 1 to 8 expstep 2 set partition_factor for md_knn. force_x 8 generate configs generate trace 44
Xenon: Generation Procedure Read sweep configuration file Execute sweep commands Generate all configurations Backend: Generate any additional outputs Backend: read JSON, rewrite into desired format. Export configurations to JSON 45
Xenon: Execute "Benchmark("md-knn")": { "Array("NL")": { "memory_type": "cache", "name": "NL", "partition_factor": 1, "partition_type": "cyclic", "size": 4096, "type": "Array", "word_length": 8 }, "Array("force_x")": { "memory_type": "cache", "name": "force_x", "partition_factor": 1, "partition_type": "cyclic", "size": 256, "type": "Array", "word_length": 8 }, "Array("force_y")": { "memory_type": "cache", "name": "force_y", "partition_factor": 1, "partition_type": "cyclic", "size": 256, "type": "Array", "word_length": 8 }. . . • Every configuration in a JSON file. • A backend is then invoked to load this JSON object and write application specific config files. 46
gem 5 -aladdin • System effects have significant impacts on accelerator performance and design. • gem 5 -Aladdin enables the study of end-to-end accelerated workloads, including data movement, cache coherency, and shared resource contention. • Download gem 5 -Aladdin at: http: //vlsiarch. eecs. harvard. edu/gem 5 -aladdin 47
DEMOS 48
Demo: DMA • Exercise: change system bus width and see effect on accelerator performance. • Open up your VM. • Go to: ~gem 5 -aladdin/sweeps/tutorial/dma/stencil-stencil 2 d/0 • Examine these files: – – stencil-stencil 2 d. cfg. . /inputs/dynamic_trace. gz gem 5. cfg run. sh 49
Demo: DMA • Run the accelerator with DMA simulation • Change the system bus width to 32 bits – Set xbar_width=4 in run. sh • Run again. • Compare results. 50
Demo: Caches • Exercise: see effect of cache size on accelerator performance. • Go to: ~gem 5 -aladdin/sweeps/tutorial/cache/stencil-stencil 2 d/0 • Examine these files: –. . /inputs/dynamic_trace. gz – stencil-stencil 2 d. cfg – gem 5. cfg 51
Demo: Caches • Run the accelerator with caches simulation • Change the cache size to 1 k. B. – Set cache_size = 1 k. B in gem 5. cfg. • Run again. • Compare results. • Play with some other parameters (associativity, line size, etc. ) 52
Demo: disparity • You can just watch for this one. • If you want to follow along: ~/gem 5 -aladdin/sweeps/tutorial/cortexsuite_sweep/0 • This is a multi-kernel, CPU + accelerator cosimulation. 53
Tutorial References • Y. S. Shao, S. Xi, V. Srinivasan, G. -Y. Wei, D. Brooks, “Co-Designing Accelerators and So. C Interfaces using gem 5 -Aladdin”, MICRO, 2016. • Y. S. Shao, S. Xi, V. Srinivasan, G. -Y. Wei, D. Brooks, “Toward Cache-Friendly Hardware Accelerators”, SCAW, 2015. • Y. S. Shao and D. Brooks, “ISA-Independent Workload Characterization and its Implications for Specialized Architectures, ” ISPASS’ 13. • B. Reagen, Y. S. Shao, G. -Y. Wei, D. Brooks, “Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware, ” ISLPED’ 13. • Y. S. Shao, B. Reagen, G. -Y. Wei, D. Brooks, “Aladdin: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures, ” ISCA’ 14. • B. Reagen, B. Adolf, Y. S. Shao, G. -Y. Wei, D. Brooks, “Mach. Suite: Benchmarks for Accelerator Design and Customized Architectures, ” IISWC’ 14. 54
- Heterogeneous data integration
- Helen erickson biography
- Dimensional modeling vs relational modeling
- Forward integration and backward integration
- Backwards intergration
- Simultaneous integration meaning
- Kindergarten weebly
- Huckleberry finn game
- Bisistema
- Sophia gardens events
- Sophia software
- Autodesk sketchbook wood texture
- Sap sophia antipolis
- Dr sophia hu
- Gnhq merchants
- Hagia sophia height
- Sophia schliemann
- Sophia stone philosophy
- Sophia kaounas
- Drc-101
- Sophia ananiadou
- Okno kreslené
- Sophia cheng accident
- Santa sophia the gables
- Sophia dilbert
- Sophia strolz
- Biografia sophia de mello breyner andresen
- Sophia pellitteri
- Sophia kazinnik
- Holy sophia university
- Sirens
- Philos etimologia
- Philia sophia
- Sophia duval
- Jessica howard meredith
- Sophia pandey
- Sophia shao
- Fileo sophia
- Linda evangelista face shape
- Bbsrc responsive mode
- Istituto universitario sophia
- Groomingdales nyc
- Sophiatown drama summary pdf
- Sofa griego
- Sophia acord uf
- Amor de amigo filosofia
- Sophia biblioteca ifce
- Sophia brelvi
- Sophia hinz
- Spirit of truth sophia institute
- Josef a high school student tells his therapist
- Ryan pandey
- Ministerstyre för och nackdelar
- Vem räknas som jude
- Claes martinsson