ABACUS A HardwareBased Software Profiler for Modern Processors
ABACUS: A Hardware-Based Software Profiler for Modern Processors Eric Matthews • Lesley Shannon School of Engineering Science Sergey Blagodurov • Sergey Zhuravlev • Alexandra Fedorova School of Computing Science Simon Fraser University, Vancouver, BC, Canada
Overview Legendary Introduction to ABACUS Delicious Profiling Units Epic Conclusion 2
Introduction to ABACUS 3
Introduction to ABACUS 4
Introduction to ABACUS 5
Introduction to ABACUS 6
ABACUS 7
ABACUS ASPLOS rocks! 8
ABACUS 9
Performance comparison Memory Reuse Profile ABACUS avg runtime: 48. 5 seconds ABACUS Simics avg runtime: 1 hour 6 minutes Simics 10
Conclusion ABACUS is a generic profiler that can be easily integrated into modern processors It can be used by the O/S to obtain runtime information about a thread’s behaviour to make better thread assignments 11
Thank you! Questions?
Motivation Future systems will be multi-core and heterogeneous How does the OS place threads on this architecture? Characterize thread behaviour Instruction Mix Memory Reuse Profile Effectiveness of pre-fetching Memory bandwidth utilization 13
Motivation (cont'd) How are these metrics collected? Offline analysis Code Instrumentation Simulation (e. g. , Simics) Software-based instruction set simulator Models systems with full OS support 14
Motivation (cont'd) Why not use current hardware counters? Architecture-specific Not all desired metrics provided Help detect symptoms, not causes Limited in number and in concurrent use 15
Goal Create a hardware profiler to collect thread characteristics at runtime Imposed constraints External to processor Minimally invasive Cycle accurate OS controllable 16
ABACUS h. Ardware-Based Analyzer for the Characterization of User Software A collection of runtime configurable profiling units Collects metrics useful for thread placement Controllable through the O/S 17
Hardware Platform Proof-of-concept System LEON 3 Sparc v 8 Instruction Set Architecture Single core, single threaded Test System Open. Sparc Niagara T 1 soft processor 1 to 4 hardware threads Multi-core Multi-board support 18
Hardware Platform (cont'd) 19
ABACUS 20
External Interface Bus slave and master modules Processing required on processor signals Designed such that only external interface changes with different processor/system 21
Portability Previously integrated with a LEON 3 (Sparc v 8 ISA) based system Differences: AMBA Advanced High-performance Bus (AHB) vs Processor Local Bus (PLB) Processor internals 22
Controller Starts or stops profiling Can limit profiling to a specific address range DMA interface for retrieving collected data Linux device driver support 23
Profiling Units Operate on one or more processor signals: Instruction PC Cache Reuse Distance etc. Store data in a collection of counters 24
Profiling Units (cont'd) Focus on two dimensional metrics – Gives bigger picture / greater insight Aim to be as architecture independent as possible 25
Profile Unit Behaves like a traditional software profiler Operates on Program Counter Code Space Range Overlap Range Non-Overlap Trace 26
Memory Reuse Unit Collects a measure of code or data reuse Utilizes Least Recently Used (LRU) stack Reuse distance is movement in the LRU stack or a miss Uses in cache contention management 27
Memory Reuse Unit Creates histogram of cache reuse pattern Range: [0, set associativity – 1] or cache miss 4 -way setassociative reuse profile Reuse Distance 28
Instruction Mix Identify current instruction subset in use Divide instructions into logical categories Load/Store Floating Point Control Flow Opcode-based table lookup 29
Latency Unit Break down miss latency into constituent sources Bus contention DRAM latency etc. For each category create a histogram of latency in cycles 30
Stall Unit Break down Cycles Per Instruction Attribute cycles to their sources Cache miss Translation Lookaside Buffer (TLB) miss Floating Point busy stalls etc. 31
Verification Run a subset of the SPECCPU 2006 benchmarks Those with memory usage within board specs Collect metrics with ABACUS and Simics Profile for a few billion instructions Limited by Simics performace 32
Test Platform Proof-of-concept System Single core, single threaded Processor LEON 3 (SPARC v 8 ISA) (50 MHz) Memory 256 MB DDR RAM OS Debian Etch (4. 0) XUP V 2 Pro: 90% slice utilization 33
Simulation Platform Simics System: Processor Ultra. Sparc II (SPARC v 9 ISA) Memory 256 MB DDR RAM OS Debian Etch (4. 0) Differences: SPARC v 9 ISA (64 -bit processor) Local filesystem vs NFS 34
LEON 3 Comparison ABACUS Simics 35
LEON 3 Comparison (cont'd) DC Memory Reuse Profile ABACUS Simics 36
Resource Usage Default: 2–way LRU Instruction Cache 2–way LRU Data Cache 5 Instruction Types 32 bit counters 40 bit counters 32 bit counters Profile Unit added 37
Conclusion ABACUS is a generic profiler that can be easily integrated into modern processors It can be used by the O/S to obtain runtime information about a thread’s behaviour to make better thread assignments 38
Future Plans Move to multi-core/multi-threaded system Memory reuse distance independent of existing cache implementation Process tracking Integrate results into OS scheduler 39
Questions ?
- Slides: 40