Dynamic Feedback An Effective Technique for Adaptive Computing

Dynamic Feedback: An Effective Technique for Adaptive Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara http: //www. cs. ucsb. edu/~{pedro, martin}

Basic Issue: Efficient Implementation of Atomic Operations in Object-Based Languages Approach: Reduce Lock Overhead by Coarsening Lock Granularity Problem: Coarsening Lock Granularity May Reduce Available Concurrency

Solution: Dynamic Feedback • Multiple Lock Coarsening Policies • Dynamic Feedback • Generate Multiple Versions of Code • Measure Dynamic Overhead of Each Policy • Dynamically Select Best Version • Context • Parallelizing Compiler • Irregular Object-Based Programs • Pointer-Based Data Structures • Commutativity Analysis

Talk Outline • Lock Coarsening • Dynamic Feedback • Experimental Results • Related Work • Conclusions

Model of Computation • Parallel Programs Atomic Operations Serial Phase • Serial Phases • Parallel Phases Parallel Phase Serial Phase • Atomic Operations on Shared Objects • Mutual Exclusion Locks • Acquire Constructs • Release Constructs L. acquire() Mutual Exclusion Region L. release()

Problem: Lock Overhead L. acquire() L. release()

Solution: Lock Coarsening Original After Lock Coarsening L. acquire() L. release() Reference: Diniz and Rinard “Synchronization Transformations for Parallel Computing”, POPL 97

Lock Coarsening Trade-Off • Advantage: • Reduces Number of Executed Acquires and Releases • Reduces Acquire and Release Overhead • Disadvantage: May Introduce False Exclusion • Multiple Processors Attempt to Acquire Same Lock • Processor Holding the Lock is Executing Code that was Originally in No Mutual Exclusion Region

False Exclusion Original L. acquire() L. release() After Lock Coarsening L. acquire() L. release() • • • L. release() False Exclusion

Lock Coarsening Policy Goal: Limit Potential Severity of False Exclusion Mechanism: Multiple Lock Coarsening Policies • Original: Never Coarsen Granularity • Bounded: Coarsen Granularity Only Within Cycle-Free Subgraphs of ICFG • Aggressive: Always Coarsen Granularity

Choosing Best Policy • Best Lock Coarsening Policy May Depend On • Topology of Data Structures • Dynamic Schedule Of Computation • Information Required to Choose Best Policy Unavailable at Compile Time • Complications • Different Phases May Have Different Best Policy • In Same Phase, Best Policy May Change Over Time

Solution: Dynamic Feedback • Generated Code Executes • Sampling Phases: Measure Performance of Different Policies • Production Phases : Use Best Policy From Sampling Phase • Periodically Resample to Discover Best Policy Changes Original Bounded Aggressive Original Overhead Code Version Time Sampling Phase Production Phase Sampling Phase

Guaranteed Performance Bounds • Assumptions: • Overhead Changes Bounded by Exponential Decay Functions • Worst Case Scenario: No Useful Work During Sampling Phase Sampled Overheads Are Same For All Versions Overhead of Selected Version Increases at Maximum Rate Overhead of Other Versions Decreases at Maximum Rate Overhead • • V 0 S S S P Time

Guaranteed Performance Bound Definition 1. Policy p is at Most Worse Than Policy p i over a Time Interval T if T i T j Work - Work Š T where T i Work = j T �(1 - oi(t)) dt 0 Definition 2. Dynamic Feedback is at Most Worse Than the Optimal if P+SN opt Work - Work P 0 Š (P+SN) where Work P+SN opt = P+SN � 1 Result 1. To Guarantee this Bound (1 - ) P + (1/ ) e(- P) Š ( - 1) SN + (1/ ) (1 - o 1(t)) dt

Guaranteed Performance Bounds Constraint Values (1 - ) P + (1/ ) e(- P) ( - 1) SN + (1/ ) Feasible Region Production Interval P Production Interval Too Short: Unable to Amortize Sampling Overhead Production Interval Too Long: May Execute Suboptimal Policy for Long Time Basic Constraint: Decay Rate ( ) Must be Small Enough

Dynamic Feedback: Implementation • Code Generation • Measuring Policy Overhead • Interval Selection • Interval Expiration • Policy Switch

Code Generation • Statically Generate Different Code Versions for Each Policy • Alternative: Dynamic Code Generation • Advantages of Static Code Generation: • Simplicity of Implementation • Fast Policy Switching • Potential Drawback of Static Code Generation • Code Size (In Practice Not a Problem)

Measuring Policy Overhead • Sources of Overhead • Locking Overhead • Waiting Overhead • Compute Locking Overhead • Count Number of Executed Acquire/Release Constructs • Estimate Waiting Overhead • Count Number of Spins on Locks Waiting to be Released Sampled Overhead = ( ) + (Acquire/Releasex Execution Time ) Number of Spins x Spin Time Number of Sampling Time Acquire/Release

Interval Selection and Expiration • Fixed Interval Values • Sampling Interval: 10 milliseconds • Production Interval: 10 seconds • Good Results for Wide Range of Interval Values • Polling Code for Expiration Detection • Location: Back Edges of Parallel Loop • Advantage: Low Overhead • Disadvantage: Potential Interaction with Iteration Size Polling Points Atomic Operations

Policy Switch • Synchronous • Processors Poll Timer to Detect Interval Expiration • Barrier At End of Each Interval • Advantages: • Consistent Transitions • Clean Overhead Measurements • Disadvantages: • Need to Synchronize All Processors • Potential Idle Time At Barrier

Experimental Results • Parallelizing Compiler Based on Commutativity Analysis [PLDI’ 96] • Set of Complete Scientific Applications • Barnes-Hut N-Body Solver (1500 lines of C++) • Liquid Water Simulation Code (1850 lines of C++) • Seismic Modeling String Code (2050 lines of C++) • Different Lock Coarsening Policies • Dynamic Feedback • Performance on Stanford DASH Multiprocessor

40 20 0 Barnes-Hut Dynamic Original Serial 60 Dynamic Original 40 20 0 Water Serial Size Text Segment (Kbytes) 60 Size Text Segment (Kbytes) Code Sizes 60 40 20 0 String Dynamic Original Serial

Lock Overhead Percentage of Time that the Single Processor Execution Spends Acquiring and Releasing Mutual Exclusion Locks 0 Bounded Aggressive Barnes-Hut (16 K Particles) 40 20 0 Original Bounded Aggressive Water (512 Molecules) Percentage Lock Overhead Original Percentage Lock Overhead 40 20 60 60 60 40 20 0 Original Aggressive String (Big Well Model)

Contention Overhead Contention Percentage of Time that Processors Spend Waiting to Acquire Locks Held by Other Processors 100 75 50 25 0 0 4 8 12 16 Processors Barnes-Hut (16 K Particles) 0 4 8 12 16 Processors Water (512 Molecules) Aggressive Bounded Original 0 4 8 12 16 Processors String (Big Well Model)

Performance Results: Barnes-Hut 16 Ideal Aggressive Dynamic Feedback Bounded Speedup 12 8 Original 4 0 0 4 8 12 Number of Processors Barnes-Hut on DASH (16 K Particles) 16

Performance Results: Water 16 Ideal Bounded Dynamic Feedback Original Aggressive Speedup 12 8 4 0 0 4 8 12 Number of Processors Water on DASH (512 Molecules) 16

Speedup Performance Results: String 16 Ideal 12 Original Dynamic Feedback Aggressive 8 4 0 0 4 8 12 Number of Processors String on DASH (Big Well Model) 16

Summary • Code Size Is Not An Issue • Lock Coarsening Has Significant Performance Impact • Best Lock Coarsening Policy Varies With Application • Dynamic Feedback Delivers Code With Performance Comparable to The Best Static Lock Coarsening Policy

Related Work • Adaptive Execution Techniques (Saavedra Park: PACT 96) • Dynamic Dispatch Optimizations (Hölzle Ungar: PLDI 94) • Dynamic Code Generation (Engler: PLDI 96) • Profiling (Brewer: PPo. PP 95) • Synchronization Optimizations (Plevyak et al: POPL 95)

Conclusions • Dynamic Feedback • Generated Code Adapts to Different Execution Environments • Integration with Parallelizing Compiler • Irregular Object-Based Programs • Pointer-Based Linked Data Structures • Commutativity Analysis • Evaluation with Three Complete Applications • Performance Comparable to Best Hand-Tuned Optimization

BACKUP SLIDES

Speedup Performance Results : Barnes-Hut 16 Ideal 14 Aggressive 12 Bounded 10 Original 8 6 4 2 0 0 2 4 6 8 10 12 14 Number of Processors Barnes-Hut (16 K Particles) 16

Speedup Performance Results: Water 16 Ideal 14 Bounded 12 Original 10 Aggressive 8 6 4 2 0 0 2 4 6 8 10 12 14 Number of Processors Water (512 Molecules) 16

Performance Results: String 16 Ideal 14 Original Speedup 12 Aggressive 10 8 6 4 2 0 0 2 4 6 8 10 12 14 Number of Processors String (Big Well Model) 16

Policy Switch Policy 1 Policy 2 Timer Expires

Motivation Challenges: • Match Best Implementation to Environment • Heterogeneous and Mobile Systems Goal: • Develop Mechanisms to Support Code that Adapts to Environment Characteristics Technique: • Dynamic Feedback

Overhead for Barnes-Hut Sampled Overhead 0. 5 0. 4 Original 0. 3 Bounded 0. 2 0. 1 Aggressive 0 0 5 10 15 20 25 Execution Time (Seconds) Barnes-Hut on DASH (8 Processors) FORCES Loop Data Set - 16 K Particles

Overhead for Water Sampled Overhead 0. 5 0. 4 0. 3 0. 2 Original 0. 1 0 Bounded 0 10 20 30 40 50 60 Execution Time (Seconds) Water on DASH (8 Processors) INTERF Loop Data Set - 512 Molecules

Overhead for Water Sampled Overhead 1 Aggressive 0. 8 0. 6 0. 4 0. 2 Original 0 0 10 20 30 40 50 60 Execution Time (Seconds) Water on DASH (8 Processors) POTENG Loop Data Set - 512 Molecules

Overhead for String Sampled Overhead 1 Aggressive 0. 8 0. 6 0. 4 0. 2 0 Original 0 100 200 300 400 500 Execution Time (Seconds) String on DASH (8 Processors) PROJFWD Loop Data Set -Big Well

Dynamic Feedback Aggressive Bounded Original Aggressive Overhead Code Version Time Sampling Phase Production Phase Sampling Phase