Parallel Programming and Timing Analysis on Embedded Multicores

Parallel Programming and Timing Analysis on Embedded Multicores Eugene Yip The University of Auckland Supervisors: Dr. Partha Roop Dr. Morteza Biglari-Abhari Advisor: Dr. Alain Girault

Outline • • • Introduction Fore. C language Timing analysis Results Conclusions

Introduction • Safety-critical systems: – Performs specific tasks. – Behave correctly at all times. – Compliance to strict safety standards. [IEC 61508, DO 178] – Time-predictability useful in real-time designs. [Paolieri et al 2011] Towards Functional-Safe Timing-Dependable Real-Time Architectures.

Introduction • Safety-critical systems: – Shift from single-core to multicore processors. – Better power and execution performance. Core 0 Shared Core n System bus Resource Shared [Blake et al 2009] A Survey of Multicore Processors. [Cullmann et al 2010] Predictability Considerations in the Design of Multi-Core Embedded Systems.

Introduction • Parallel programming: – From super computers to mainstream computers. – Threaded programming model. – Frameworks designed for systems without resource constraints or safety-concerns. – Improving average-case performance (flops), not time-predictability.

Introduction • Parallel programming: – Programmer responsible for shared resources. – Concurrency errors: • • Deadlock Race condition Atomic violation Order violation – Non-deterministic thread interleaving. – Determinism essential for understanding and debugging. [Mc. Dowell et al 1989] Debugging Concurrent Programs.

Introduction • Synchronous languages – Deterministic concurrency. – Based on the synchrony hypothesis. – Threads execute in lock-step to a global clock. – Concurrency is logical. Typically compiled away. Inputs 1 2 Outputs [Benveniste et al 2003] The Synchronous Languages 12 Years Later. 3 4 Global ticks

Introduction • Synchronous languages Defined by the timing requirements of the system Time between each tick 1 Must validate: max(Reaction time) < min(Time of each tick) 2 Reaction time [Benveniste et al 2003] The Synchronous Languages 12 Years Later. 3 4 Physical time

Introduction • Synchronous languages – Esterel – Lustre – Signal – Synchronous extensions to C: • • Concurrent threads scheduled sequentially in a cooperatively manner. Atomic execution of threads PRET-C which ensures thread-safe Reactive. C with shared variables. access to shared variables. Synchronous C (SC – see Michael’s talk) Esterel C Language [Roop et al 2009] Tight WCRT Analysis of Synchronous C Programs. [Boussinot 1993] Reactive Shared Variables Based Systems. [Hanxleden et al 2009] Sync. Charts in C - A Proposal for Light-Weight, Deterministic Concurrency. [Lavagno et al 1999] ECL: A Specification Environment for System-Level Design.

Introduction • Synchronous languages – Esterel – Lustre – Signal – Synchronous extensions to C: • • Writes to shared variables are delayed to the end of the global tick. At the end of the global tick, the writes are combined and assigned to the shared variable. Associative and commutative PRET-C “combine function”. Reactive. C with shared variables. Synchronous C (SC – see Michael’s talk) Esterel C Language [Roop et al 2009] Tight WCRT Analysis of Synchronous C Programs. [Boussinot 1993] Reactive Shared Variables Based Systems. [Hanxleden et al 2009] Sync. Charts in C - A Proposal for Light-Weight, Deterministic Concurrency. [Lavagno et al 1999] ECL: A Specification Environment for System-Level Design.

Outline • • • Introduction Fore. C language Timing analysis Results Conclusions

Fore. C language “Foresee” • Deterministic parallel programming of embedded multicores. • C with a minimal set of synchronous constructs for deterministic parallelism. • Fork/Join parallelism (explicit). • Shared memory model. • Deterministic thread communication using shared variables.

Fore. C language • Constructs: – par(t 1, …, tn) • Fork threads t 1 to tn to execute in parallel, in any order. • Parent thread is suspended, until all child threads terminate. – thread t 1(. . . ) {b} • Thread definition. – pause • Synchronisation barrier. • When a thread pauses, it completes a local tick. • When all threads pause, the program completes a global tick.

Fore. C language • Constructs: – abort {b} when (c) • Preempts the body b when the condition c is true. The condition is checked before executing the body. – weak abort {b} when (c) • Preempts the body when the body reaches a pause and the condition c is true. The condition is checked before executing the body.

Fore. C language • Variable type qualifiers: – input • Variable gets its value from the environment. – output • Variable emits value to the environment.

Fore. C language • Variable type qualifiers: – shared • Variable which may be accessed by multiple threads. • At the start of a thread’s local tick, it creates local copies of shared variables that it accesses. • During the thread’s local tick, it modifies its local copy (isolation). • At the end of the global tick, copies that have been modified are combined using a commutative and associative function (combine function). • The combined result is committed back to the original shared variable.

Fore. C language shared int x = 0; void main(void) { x = 1; par(t 0(), t 1()); x = x - 1; } thread t 0(void) { x = 10; x = x + 1; pause; x = x + 1; } thread t 1(void) { x = x * 2 pause; x = x * 2; }

Fore. C language shared int x = 0; void main(void) { x = 1; par(t 0(), t 1()); x = x - 1; } thread t 0(void) { x = 10; x = x + 1; pause; x = x + 1; } thread t 1(void) { x = x * 2 pause; x = x * 2; } Concurrent Control. Flow Graph

Fore. C language • Sequential control-flow along a single path. • Parallel control-flow along branches from a fork node. • Global tick ends when all threads pause or terminate.

Fore. C language State of the shared variables Global: x 0

Fore. C language State of the shared variables Global: x 0 Thread main creates a local copy of x.

Fore. C language State of the shared variables Global: x 0 main 0 Thread main creates a local copy of x.

Fore. C language State of the shared variables Global: x 0 main 1

Fore. C language State of the shared variables Global: x 0 main 1 Threads t 0 and t 1 take over main’s copy of the shared variable x.

Fore. C language State of the shared variables Global: x 0 t 1 1 1 Threads t 0 and t 1 take over main’s copy of the shared variable x.

Fore. C language State of the shared variables Global: x 0 t 1 10 1

Fore. C language State of the shared variables Global: x 0 t 1 11 1

Fore. C language State of the shared variables Global: x 0 t 1 11 2

Fore. C language State of the shared variables Global: x 0 t 1 11 2 Global tick is reached. • Combine the copies of x together using a (programmer defined) associative and commutative function. • Assume the combine function for x implements summation.

Fore. C language State of the shared variables Global: x 0 t 1 11 2 • Assign the combined value back to x.

Fore. C language State of the shared variables Global: x 13 • Assign the combined value back to x.

Fore. C language State of the shared variables Global: x 13 t 0 t 1 13 13 Next global tick. • Active threads create a copy of x.

Fore. C language State of the shared variables Global: x 13 t 0 t 1 14 13

Fore. C language State of the shared variables Global: x 13 t 0 t 1 14 26

Fore. C language State of the shared variables Global: x 13 t 0 t 1 14 26 Threads t 0 and t 1 terminate and join back to the parent thread main. • Local copies of x are combined into a single copy and given back to the parent thread main.

Fore. C language State of the shared variables Global: x 13 main 40 Threads t 0 and t 1 terminate and join back to the parent thread main. • Local copies of x are combined into a single copy and given back to the parent thread main.

Fore. C language State of the shared variables Global: x 13 main 39

Fore. C language State of the shared variables Global: x 39

Fore. C language • Shared variables. – Threads modify local copies of shared variables. – Isolates thread execution behaviour. – Order/interleaving of thread execution has no impact on the final result. – Prevents concurrency errors. – Associative and commutative combine functions. • Order of combining doesn’t matter.

Scheduling • Light-weight static scheduling. – Take advantage of multicore performance while delivering time-predictability. – Thread allocation and scheduling order on each core decided at compile time by the programmer. – Cooperative (non-preemptive) scheduling. – Fork/join semantics and notion of a global tick is preserved via synchronisation.

Scheduling • One core to perform housekeeping tasks at the end of the global tick. – Combining shared variables. – Emitting outputs. – Sampling inputs and start the next global tick.

Outline • • • Introduction Fore. C language Timing analysis Results Conclusions

Timing analysis • Compute the program’s worst-case reaction time (WCRT). Defined by the timing requirements of the system Time between each tick 1 Reaction time Must validate: max(Reaction time) < min(Time of each tick) 2 3 4 Physical time

Timing analysis Existing approaches for synchronous programs. • Integer Linear Programming (ILP) • Max-Plus • Model Checking

Timing analysis Existing approaches for synchronous programs. • Integer Linear Programming (ILP) – Execution time of the program described as a set of integer equations. – Solving ILP is known to be NP-hard. • Max-Plus • Model Checking [Ju et al 2010] Timing Analysis of Esterel Programs on General-Purpose Multiprocessors.

Timing analysis Existing approaches for synchronous programs. • Integer Linear Programming (ILP) • Max-Plus – Compute the WCRT of each thread. – Using the thread WCRTs, the WCRT of the program is computed. – Assumes there is a global tick where all threads execute their worst-case. • Model Checking

Timing analysis Existing approaches for synchronous programs. • Integer Linear Programming (ILP) • Max-Plus • Model Checking – Eliminate false paths by explicit path exploration (reachability over the program’s CFG). – Binary search: Check the WCRT is less than “x”. – State-space explosion problem. – Trades-off analysis time for precision. – Provides execution trace for the WCRT.

Timing analysis • Our approach using Reachability: – Same benefits as model checking, but a binary search of the WCRT is not required. – To handle state-space explosion: • Reduce the program’s CCFG before analysis. Program binary (annotated) Reconstruct the program’s CCFG Find the global ticks (Reachability) WCRT

Timing analysis • Programs will execute on the following multicore: Instruction memory Data memory Core 0 Instruction memory Data memory TDMA Shared Bus Global memory Core n

Timing analysis • Computing the execution time: 1. Overlapping of thread execution time from parallelism and inter-core synchronizations. 2. Scheduling overheads. 3. Variable delay in accessing the shared bus.

Timing analysis 1. Overlapping of thread execution time from parallelism and inter-core synchronisations. • An integer counter to track each core’s execution time. • Synchronisation occurs when forking/joining, and ending the global tick. • Advance the execution time of participating cores. Core 1 Core 2 main t 1 Core 1: main t 1 Core 2: t 2

Timing analysis 2. Scheduling overheads. – Synchronisation: Fork/join and global tick. • Via global memory. – Thread context-switching. • Copying of shared variables at the start and end of a thread’s local tick via global memory. Core 1 Core 2 main t 1 Synchronisation Thread context-switch Global tick t 2

Timing analysis 2. Scheduling overheads. – Required scheduling routines statically known. – Analyse the scheduling control-flow. – Compute the execution time for each scheduling overhead. Core 1 Core 2 main t 1 Core 2 t 2 main t 1 t 2

Timing analysis 3. Variable delay in accessing the shared bus. – Global memory accessed by scheduling routines. – TDMA bus delay has to be considered. Core 1 Core 2 main t 1 t 2

Timing analysis 3. Variable delay in accessing the shared bus. – Global memory accessed by scheduling routines. – TDMA bus delay has to be considered. Core 1 Core 2 main t 1 t 2 slots Core 2 1 2 1 2 1 2

Timing analysis 3. Variable delay in accessing the shared bus. – Global memory accessed by scheduling routines. – TDMA bus delay has to be considered. Core 1 Core 2 main t 1 t 2 Core 2 1 2 1 2 1 2 t 2

Timing analysis • CCFG optimisations: – merge: Reduce the number of CFG nodes that need to be traversed for each local tick. – merge-b: Reduce the number of alternate paths between CFG nodes.

Timing analysis • CCFG optimisations: – merge: Reduce the number of CFG nodes that need to be traversed for each local tick.

Timing analysis • CCFG optimisations: – merge: Reduce the number of CFG nodes that need to be traversed for each local tick. merge

Timing analysis • CCFG optimisations: – merge-b: Reduce the number of possible paths between CFG nodes. • Reduces the number of reachable global ticks. merge

Timing analysis • CCFG optimisations: – merge-b: Reduce the number of possible paths between CFG nodes. • Reduces the number of reachable global ticks. merge-b

Outline • • • Introduction Fore. C language Timing analysis Results Conclusions

Results • For the proposed reachability-based timing analysis, we demonstrate: – the precision of the computed WCRT. – the efficiency of the analysis, in terms of analysis time.

Results • Timing analysis tool: Program binary (annotated) Explicit path exploration (Reachability) Program CCFG (optimisations) WCRT Implicit path exploration (Max-Plus) Taking into account the 3 factors

Results • Multicore simulator (Xilinx Micro. Blaze): – Based on http: //www. jwhitham. org/c/smmu. html and extended to be cycle-accurate and support multiple cores and a TDMA bus. 1 cycle 16 KB Instruction memory Data memory Core 0 Core n TDMA Shared Bus Global memory 32 KB 5 cycles/core (Bus schedule round = 5 * no. cores)

Results Benchmark programs. * * # • Mix of control/data computations, thread structure and computation load. * [Pop et al 2011] A Stream-Computing Extension to Open. MP. # [Nemer et al 2006] A Free Real-Time Benchmark.

Results • Each benchmark program was distributed over varying number of cores. – Up to the maximum number of parallel threads. • Observed the WCRT: – Test vectors to elicit different execution paths. • Computed the WCRT: – Reachability – Max-Plus

802. 11 a Results 200 000 Observed 180 000 Reachability WCRT (clock cycles) 160 000 Max. Plus 140 000 120 000 100 000 80 000 60 000 40 000 20 000 0 1 2 3 4 5 6 Cores 7 8 9 10 • WCRT decreases until 5 cores. • Global memory increasingly expensive. • Scheduling overheads.

802. 11 a Results 200 000 Observed 180 000 Reachability WCRT (clock cycles) 160 000 Max. Plus 140 000 120 000 100 000 80 000 60 000 40 000 20 000 0 1 2 3 4 5 6 Cores 7 8 9 10 Reachability: • ~2% overestimation. • Benefit of explicit path exploration.

802. 11 a Results 200 000 Observed 180 000 Reachability WCRT (clock cycles) 160 000 Max. Plus 140 000 120 000 100 000 80 000 60 000 40 000 20 000 0 1 2 3 4 5 6 Cores 7 8 9 10 Max-Plus: • Loss of execution context: Uses only the thread WCRTs. • Assumes one global tick where all threads execute their worst-case. • Max execution time of the scheduling routines.

802. 11 a Results 200 000 Observed 180 000 Reachability WCRT (clock cycles) 160 000 Max. Plus 140 000 120 000 100 000 80 000 60 000 40 000 20 000 0 1 2 3 4 5 6 Cores 7 8 9 10 Both approaches: • Estimation of synchronisation cost is conservative. Assumed that the receive only starts after the last sender.

802. 11 a Results Max-Plus takes less than 2 seconds. 2 500 Analysis Time (seconds) Reachability 2 000 1 500 1 000 500 0 1 2 3 4 5 6 Cores 7 8 9 10

802. 11 a Results merge: • Reduction of ~9. 34 x 2 500 Analysis Time (seconds) Reachability 2 000 1 500 1 000 500 Reachability (merge) 0 1 2 3 4 5 6 Cores 7 8 9 10

802. 11 a Results merge: • Reduction of ~9. 34 x 2 500 Analysis Time (seconds) Reachability 2 000 1 500 1 000 500 Reachability (merge) Reachability (merge-b) 0 1 2 3 4 5 6 Cores 7 8 9 10

802. 11 a Results merge: • Reduction of ~9. 34 x merge-b: • Reduction of ~342 x • Less than 7 sec. 2 500 Analysis Time (seconds) Reachability 2 000 1 500 1 000 500 Reachability (merge) Reachability (merge-b) 0 1 2 3 4 5 6 Cores 7 8 9 10

802. 11 a Results Number of global ticks explored. Reduction in states reduction in analysis time

Observed Results Fm. Radio Fly by Wire Reachability Max. Plus Life Matrix 7 000 140 000 35 000 6 000 120 000 30 000 5 000 100 000 25 000 4 000 80 000 20 000 3 000 60 000 15 000 2 000 40 000 1 000 20 000 5 000 0 0 45 000 40 000 35 000 30 000 15 000 10 000 5 000 0 0 1 2 3 Cores 4 1 2 3 4 5 Cores 6 7 8 Reachability: • ~1 to 8% over-estimation. • Loss in precision mainly from over-estimating the synchronisation costs.

Observed Results Fm. Radio Fly by Wire Reachability Max. Plus Life Matrix 7 000 140 000 35 000 6 000 120 000 30 000 5 000 100 000 25 000 4 000 80 000 20 000 3 000 60 000 15 000 2 000 40 000 1 000 20 000 5 000 0 0 45 000 40 000 35 000 30 000 15 000 10 000 5 000 0 0 1 2 3 Cores 4 1 2 3 4 5 Cores 6 7 8 1 2 3 4 5 Cores 6 Max-Plus: • Over-estimation very dependent on program structure. • Fm. Radio and Life very imprecise. Loops iterating over par statement(s) multiple times. Over-estimations are multiplied. • Matrix quite precise. Executes in one global tick. Thus, thread WCRT assumption is valid. 7 8

Results • Timing trace of the WCRT. – For each core: Thread start/end time, contextswitching, fork/join, . . . – Can be used to tune thread distribution. – Used to find good thread distributions for each benchmark program.

Outline • • • Introduction Fore. C language Timing analysis Results Conclusions

Conclusions • Fore. C language for deterministic parallel programming. • Based on synchronous framework. • Able to achieve WCRT speedup while providing time-predictability. • Very precise, fast and scalable timing analysis for multicore programs using reachability.

Future work • Complete the formal semantics of Fore. C. • Prune additional infeasible paths using value analysis. • WCRT-guided, automatic thread distribution. • Cache hierarchy in the analysis.

Questions?

Introduction • Existing parallel programming solutions. – Shared memory model. • Open. MP, Pthreads • Intel Cilk Plus, Thread Building Blocks • Unified Parallel C, Par. C, X 10 – Message passing model. • MPI, SHIM – Provides ways to manage shared resources but not prevent concurrency errors. [Open. MP] http: //openmp. org [Pthreads] https: //computing. llnl. gov/tutorials/pthreads/ [X 10] http: //x 10 -lang. org/ [Intel Cilk Plus] http: //software. intel. com/en-us/intel-cilk-plus [Intel Thread Building Blocks] http: //threadingbuildingblocks. org/ [Unified Parallel C] http: //upc. lbl. gov/ [Ben-Asher et al] Par. C – An Extension of C for Shared Memory Parallel Processing. [MPI] http: //www. mcs. anl. gov/research/projects/mpi/ [SHIM] SHIM: A Language for Hardware/Software Integration.

Introduction • Deterministic runtime support. – Pthreads • d. OS, Grace, Kendo, Core. Det, Dthreads. – Open. MP • Deterministic OMP – Concept of logical time. – Each logical time step broken into an execution and communication phase. [Bergan et al 2010] Deterministic Process Groups in d. OS. [Olszewski et al 2009] Kendo: Efficient Deterministic Multithreading in Software. [Bergan et al 2010] Core. Det: A Compiler and Runtime System for Deterministic Multithreaded Execution. [Liu et al 2011] Dthreads: Efficient Deterministic Multithreading. [Aviram 2012] Deterministic Open. MP.

Fore. C language • Behaviour of shared variables is similar to: • • • Intel Cilk+ (Reducers) Unified Parallel C (Collectives) DOMP (Workspace consistency) Grace (Copy-on-write) Dthreads (Copy-on-write)

Fore. C language • Parallel programming patterns: – Specifying an appropriate combine function. – Sacrifice for deterministic parallel programs. – Map-reduce – Scatter-gather – Software pipelining – Delayed broadcast or point-to-point communication.