Memory Consistency in Vector IRAM David Martin The

The Memory Consistency Model • Consistency model applies to instructions in a single instruction

Why Relax Memory Consistency? Fetch Scalar Core Sync Ø Vector Unit Ø Memory Natural

Software Conventions for Syncs Vector Function Scalar Code Va. S, Va. V Sa. V

Sync Implementations and Costs • Sa. V : Stall fetch unit until vector unit

Current Sync Analysis Tool • Ordered by architectural guarantees • Ordered by register dependencies

Optimizing Code • Basic problem: – Vector unit requires setup: VL, VPW, mask, exceptions

Optimization Example • Demonstrates potential benefit from optimizing scalar-vector communication • Code computes A+B+C+D+E+F

• Large optimization potential for short vector loops. • Sa. V syncs are

Slides: 9

Download presentation

Memory Consistency in Vector IRAM David Martin

The Memory Consistency Model • Consistency model applies to instructions in a single instruction stream (different than multi-processor consistency!). Sa. S Sa. V Va. S Va. V VPa. VP Ra. W * + + Wa. R * + + Wa. W * + + a = after R = read W = write S = scalar V = vector VP = virtual processor * = no sync required + = sync required • Definition of a “Xa. Y” sync: • All operations of type Y occurring before the sync in program order appear to execute before any operation of type X occurring after the sync in program order. • Definition of a “Xa. Y” sync to vector register $vri: • The most recent operation of type Y to $vri appears to execute before any operation of type X occurring after the sync in program order.

Why Relax Memory Consistency? Fetch Scalar Core Sync Ø Vector Unit Ø Memory Natural microarchitecture has multiple paths to memory Want to decouple scalar and vector units without complex hardware • Trade-off between more complex hardware (speculation, disambiguation, cache coherence) and more complex software (sync instructions) • Should explore solutions to this trade-off that involve more hardware: e. g. Hardware guarantees Sa. V and Va. S ordering, but leaves Va. V and VP orderings to software.

Software Conventions for Syncs Vector Function Scalar Code Va. S, Va. V Sa. V Vector Code Conventions: 1. Execute Va. S and Va. V syncs on entry to vector code. 2. Execute Sa. V sync on exit from vector code. • Vector code is responsible for not messing things up. – Allows us to vectorize libraries to speed up existing programs. – Don’t want to assume that our compiler will compile and globally optimize all non-vector code that we run. • Alternative model: Pass around flags to communicate sync requirements or history – Must assume that our compiles all code run on IRAM. – Not sure we want to accept that restriction.

Sync Implementations and Costs • Sa. V : Stall fetch unit until vector unit has committed all vector memory instructions. – Could take 1000 s of cycles with many indexed vector memory operations in flight! – Very difficult to delay issue since it is often issued at the end of a vector routine. • Va. S : Stall fetch unit until scalar unit has committed all scalar memory instructions. – Not too expensive (10 s of cycles? ) because scalar unit is ahead of the vector unit, because the scalar core is simple, and because the data cache is write-thru. – Easy to delay issue because it is often issued at the start of a vector routine. • Va. V and VPa. VP: No operation. – Nop because we have 1 vector memory unit and no vector caches.

Current Sync Analysis Tool • Ordered by architectural guarantees • Ordered by register dependencies • Ordered by an intervening sync instruction 2. Whenever a sync instruction is not used to resolve any hazard, as described in (1). Two Examples of Synchronization Chains Hazard? 1. Whenever two memory references are not: Write(A) <- r 1 RAW SYNC Read(A) <- r 2 WAR SYNC Write(A) <- r 3 Hazard? • Executes a program and tells you: Write(A) <- r 1 RAW SYNC Read(A) <- r 2 Write(A) <- r 2 • Caveats: – Hazards are detected from a single program execution: Information may not hold true for all possible executions of the program. – Hazard detection is conservative in the presence of synchronization chains.

Optimizing Code • Basic problem: – Vector unit requires setup: VL, VPW, mask, exceptions – Vector code responsible for issuing syncs – Both of these are required in a vector routine if nothing is known about the calling context! • All solutions share the notion of giving control of the calling context to the compiler. Two options: (1) Pass around flags so that syncs and setup code can be avoided at run-time (2) Do global optimizations so that syncs and setup code can be eliminated at compiletime . . . Scalar code Vector setup Va. S and Va. V sync Vector function Sa. V sync Scalar code. . .

Optimization Example • Demonstrates potential benefit from optimizing scalar-vector communication • Code computes A+B+C+D+E+F in the following manner: AB CD E F + + + • Unoptimized code calls a general vector add routine 5 times • First optimization inlines the 5 routines and removes vector initialization sequences • Second optimization also removes unnecessary sync instructions • Optimization goal is to avoid “sawtooth” in instantaneous performance graphs caused by draining the vector pipelines between vector loops

• Large optimization potential for short vector loops. • Sa. V syncs are most important to eliminate or delay. • Va. S sync performance impact is unclear. • Va. V syncs are virtually free in VIRAM-1. • Setup code is expensive. For this example, it is as expensive as the Sa. V syncs.