The Mach Suite Benchmark Brandon Reagen Robert Adolf
The Mach. Suite Benchmark Brandon Reagen Robert Adolf, Yakun Sophia Shao Sam Xi, Gu-Yeon Wei David Brooks
Who Cares about Accelerators Architecture Cause: Transistors scaling Effect: Specialization & So. Cs
Who Cares about Accelerators Architecture CAD Cause: Transistors scaling Effect: Specialization & So. Cs Cause: RTL design costs Effect: C-to-RTL tools
Who Cares about Accelerators Architecture CAD Cause: Transistors scaling Effect: Specialization & So. Cs Cause: RTL design costs Effect: C-to-RTL tools ASICs Cause: Performance needs Effect: Build tuned IC
What’s Next Architecture - System Integration - Composability - Flexibility CAD Cause: RTL design costs Effect: C-to-RTL tools ASICs Cause: Performance needs Effect: Build tuned IC
What’s Next Architecture - System Integration - Composability - Flexibility CAD - Faster Turn Around - Larger App Space - Complex Designs ASICs Cause: Performance needs Effect: Build tuned IC
What’s Next Architecture - System Integration - Composability - Flexibility CAD - Faster Turn Around - Larger App Space - Complex Designs ASICs - Not much change - Need high perf ICs - H. 266
What’s Missing Architecture - System Integration - Composability - Flexibility CAD - Faster Turn Around - Larger App Space - Complex Designs ASICs - Not much change - Need high perf ICs - H. 266 Well defined specs
What’s Missing Architecture - System Integration - Composability - Flexibility CAD - Faster Turn Around - Larger App Space - Complex Designs Workload definition, common baseline ASICs - Not much change - Need high perf ICs - H. 266 Well defined specs
Tower of Babel Effect Big Problem. 10
Mach. Suite is/has • 19 application specific accelerator workloads • HLS and Aladdin compatible • Workloads researchers are using today • Diverse workloads for app space coverage • Establishes standards without stifling creativity
Why Mach. Suite • Existing Benchmarks are not applicable/sufficient • Works with Accelerator Simulators and CAD tools • Representative applications covering wide space • Kernel Selection • Algorithm Choice • Implementation Details
WHY MACHSUITE COMPARING BENCHMARKS
Existing Benchmarks are Insufficient High-Level Synthesis Is good at Crypto { AES, DES, SHA } Image/Multimedia { Stencils, JPEG, SAD} Scientific Codes { GEMM, FFT } 3 of 13 Berkeley Dwarves [CHStone, ISCAS]
Existing Benchmarks are Insufficient High-Level Synthesis Is good at Needs Improvement Crypto { AES, DES, SHA } Irregular Behavior { BFS, SPMV CRS} Image/Multimedia { Stencils, JPEG, SAD} Complex App Codes { Back. Prop, MD } Scientific Codes { GEMM, FFT } Application Space Coverage 3 of 13 Berkeley Dwarves [CHStone, ISCAS] 12 of 13 Berkeley Dwarves [Mach. Suite, IISWC/BARC]
Existing Benchmarks not Applicable • Many Existing GPU Benchmarks – Rodinia, Parboil, SHOC. . • GPU and Accelerator design spaces differ – Tuned for GPU architecture – Implemented in CUDA/Open. CL – GPU workloads subset of accelerators
WHY MACHSUITE SIMULATOR/HLS FRIENDLY
Works with Accelerator CAD Tools Functions Units Directives C Code Resource Sharing Loop Pipelining Memory Bandwidth High-Level Synthesis Vivado HLS RTL (Hardware Description Language)
Works with Simulators Mach. Suite
Works with Simulators Mach. Suite Functions Unit Selection Directives Loop Pipelining Memory Bandwidth Trade-off Power/Performance
WHY MACHSUITE WORKLOAD DIVERSITY AND COVERAGE
Incorporates Applications of Interest
Covers Application Space FFT GEMM STENCIL 12 of 13 Dwarves
Mach. Suite Design • Existing Benchmarks are not applicable/sufficient • Works with Accelerator Simulators and CAD tools • Representative applications covering wide space • Kernel Selection • Algorithm Choice • Implementation Details
MACHSUITE DESIGN KERNEL SELECTION
Kernel Selection • Kernel = A specific problem – E. g: SORT
Kernel Selection • Kernel = A specific problem – E. g: SORT • The Problem – Not all using the same kernels – Comparing similar sounding kernels doesn’t work Let’s just pick one
MACHSUITE DESIGN ALGORITHM CHOICE
Algorithm Choice • Algorithm = A specific solution – A type of kernel – E. g: Merge or Radix SORT
Algorithm Choice • Algorithm = A specific solution – A type of kernel – E. g: Merge or Radix SORT • The problem – Reporting kernel too high level – Ideal algorithms different across So. Cs Standardization without limitation
MACHSUITE DESIGN IMPLEMENTATION DETAILS
Implementation Details • Implementation = Specific code for algorithm – E. g: Stencil in Rodinia vs Parboil
Implementation Details • Implementation = Specific code for algorithm – E. g: Stencil in Rodinia vs Parboil • The problem – Can cause misleading results – Performance depends on tuning Separate signal from noise
Performance Variance due to Implementation Details 1 Kernel 1 Algorithm 1 Implementation
Performance Variance due to Implementation Details 1 Kernel 1 Algorithm 2 Implementations ~ 10 x Performance, same power
Root Causing Inefficiency Same directives: - Single port SRAMs - 8 way partition - Same loops pipelined Different Implementations for parallel SCAN
What Happened • “Unoptimized C Code” – Pipelining result: Target II: 1, Final II: 30 • “Optimized C Code” – Pipelining result: Target II: 1, Final II: 8 37
What Happened Unoptimized C Code for i = 1 : Block for radix. ID : Radix bucket[i*Block+radix. ID ] += bucket[i*Block+ radix. ID-1]; 38
What Happened Optimized C Code for radix. ID : Radix for i = 1 : Block bucket[i*Block +radix. ID ] += bucket[i*Block + radix. ID-1]; 39
Solution MEMORY SCAN Accelerator 40
Solution MEMORY SCAN Accelerator 41
Solution MEMORY ✔ SCAN Accelerator 42
Mach. Suite • 19 application specific accelerator workloads • Benchmarks work with HLS and Aladdin • Represents workloads researchers are using • Diverse workloads, broad application space • Standards with limited restrictions
Mach. Suite Available on Git. Hub http: //breagen. github. io/Mach. Suite/ Publications Aladdin: [ ISCA’ 14 ] Mach. Suite: [ IISWC’ 14 ] Quantifying Acceleration: [ ISLPED’ 13 ]
- Slides: 44