Predictable Programming on a Precision Timed Architecture Ben

Predictable Programming on a Precision Timed Architecture Ben Lickly - UC Berkeley Isaac Liu - UC Berkeley Sungjun Kim - Columbia University Hiren D. Patel – UC Berkeley Stephen A. Edwards - Columbia University Edward A. Lee - UC Berkeley

Edwards and Lee - Case for PRET • 2007 – Edwards and Lee made a case for precision timed computers (PRET machines) – Predictability – Repeatability S. A. Edwards and E. A. Lee, The case for the precision timed (PRET) machine. In Proceedings of the 44 th Annual Conference on Design Automation (San Diego, California, June 04 - 08, 2007). DAC '07. ACM, New York, NY, 264 -265. Patel, UC Berkeley, PRET 2

Edwards and Lee - Case for PRET • Unpredictability – Difficulty in determining timing behavior through analysis • Non-repeatability – Different executions may yield different timing behavior • Brittleness – Small changes have big effects on timing behavior Patel, UC Berkeley, PRET 3

Brittleness • Expensive affair • Tight coupling of software and hardware • Reliance on testing for validation Source: www. skycontrol. net • Upgrading difficult • Solution: stockpile Patel, UC Berkeley, PRET 4

But wait … • Real-time scheduling – Worst-case execution time • Detailed model of hardware • Large engineering effort • Valid for particular hardware models – Interrupts, interprocess communication, locks … • Bench testing – Brittle Patel, UC Berkeley, PRET Sebastian Altmeyer, Christian Hümbert, Björn Lisper, and Reinhard Wilhelm. Parametric Timing Analysis for Complex Architectures. In Proceedings of the 14 th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA'08), pages 367 -376, Kaohsiung, Taiwan, August 2008. IEEE Computer Society. 5

Precise Timing and High Performance Traditional Alternative Caches Scratchpads Deep pipelines Thread-interleaved pipelines Function-only ISAs with timing instructions Function-only languages Languages and programming models with timing Best-effort communication Fixed-latency communication Time-sharing Multiple independent processors Patel, UC Berkeley, PRET 6

Outline • • • Introduction Related Work PRET Machine Programming Example Future Work Conclusion Patel, UC Berkeley, PRET 7

Related Work • Java Optimized Processor – Schoeberl et al. [2003] • Timing instructions – Ip and Edwards [2006] • Reactive processors – Von Hanxleden et al. [2005] – Salcic et al. [2005] • Virtual Simple Architecture – Mueller et al. [2003] Patel, UC Berkeley, PRET 8

Semantics of Timing Instructions • Ip and Edwards [2007] • Deadline instructions – Denote the required execution time of a block • When decoded – Stall instruction until timer value is 0 – Then set timer value to new value Patel, UC Berkeley, PRET deadi … L 0: deadi … b L 0 … $t 0, 10 Straight Line Block 0 $t 0, 8 Straight Line Block 1 $t 0, 0 $t 0, 10 Loop Block 9

Tracing A Program Fragment A: deadi $t 0, 6 B: sethi %hi(0 x 3 f 800000), %g 1 C: or %g 1, 0 x 200, %g 1 D: st %g 1, [ %fp + -12 ] E: deadi $t 0, 8 F: … Patel, UC Berkeley, PRET cycle $t 0 4 0 2 3 8 5 6 1 10

Precision Timed Architecture Round-robin thread scheduling Thread-interleaved pipeline Patel, UC Berkeley, PRET Scratchpad memories Time-triggered main memory access 11

Clocks and Memory Hierarchy • Clocks – Main clock – Derived clocks • Instruction and data scratchpad memories – 1 cycle access latency Core SPM SPM SPM Main Mem. • Main memory – 16 MB size – Latency of 50 ns – Frequency: 250 Mhz DMA • ~13 cycles latency Patel, UC Berkeley, PRET 12

Thread-interleaved Pipeline • Thread stalls – Main memory access – Deadline instructions • Replay mechanism – Execute same PC next iteration Decrement Deadline Timers Fetch F/D Decode Stall if Deadline Instruction D/R Reg. Access R/E Execute Check main E/M memory access M/W Increment PC Patel, UC Berkeley, PRET Memory Write. Back 13

Time-Triggered Access through Memory Wheel • Decouple thread’s access pattern • Time-triggered access • Each thread must make and complete access within its window 90 cycles until thread 0 completes thread 0 On time thread 1 thread 2 thread 3 Patel, UC Berkeley, PRET On time thread 4 On time thread 5 thread 0 14

Tool Flow • GCC 3. 4. 4, System. C 2. 2, Python 2. 4 Boot code C programs timing instructions Patel, UC Berkeley, PRET Motorola SREC files GCC to compile boot code and program code 15

Simple Mutual Exclusion Example • Producer followed by Consumer and Observer – Consumer and Observer execute together • Loop rate of two rotations of memory wheel – 1 st for Producer to write – 2 nd Consumer and Observer to read Write to shared data Patel, UC Berkeley, PRET Read from shared data Write to output 16

Video Game Example Main. Control Thread Pixel Data Command Even Queue Command Graphic Thread Even Buffer Pixel Data VGADriver Thread Odd Buffer Odd Queue Swap (When Sync Requested and When Odd Queue Empty) Swap (When sync requested and when Vertical blank) Update Screen (Sync request) Refresh (Sync request) Sync (After queue swapped) Sync (After buffer swapped) Patel, UC Berkeley, PRET 17

Timing Requirements Signal Timing Requirement Pixel Cycles V. Sync 64µs 1611 V. Back-porch 1. 02 ms 25679 Draw 480 lines 15. 25 ms V. Front-porch 350µs 8811 H. Sync 3. 77µs 96 H. Back-porch 1. 89µs 48 Draw 640 pixels 25. 42µs H. Front-porch 0. 64µs Patel, UC Berkeley, PRET 16 18

Timing Implementation • Pixel-clock using derived clock – 25. 175 Mhz • Drawing 16 pixels Patel, UC Berkeley, PRET 19

Future Work • Architecture – – DMA DDR 2 main memory model Thread synchronization primitives Shared data between threads • Real-time Benchmarks – With timing requirements • Programming models – Memory allocation schemes – Synchronizations Patel, UC Berkeley, PRET 20

Conclusion • What we want … – Time as a first class citizen of embedded computing – Predictability – Repeatability • Where we are at … – PRET cycle-accurate simulator – Release … • http: //chess. eecs. berkeley. edu/pret/ Patel, UC Berkeley, PRET 21

Patel, UC Berkeley, PRET 22

Extras Patel, UC Berkeley, PRET 23

More on Brittleness • Small changes may have big effects on timing behavior Theorem (Richard’s anomalies): If a task set with fixed priorities, execution times, and precedence constraints is optimally scheduled on a fixed number of processors, then increasing the number of processors, reducing execution times, or weakening precedence constraints can increase the schedule length. Richard L. Graham, “Bounds on the performance of scheduling algorithms”, in E. G. Coffman, Jr. (ed. ), Computer and Job-Shop Scheduling Theory, John Wiley, New York, 1975. Patel, UC Berkeley, PRET 24

Richard’s Anomalies • 9 tasks, 3 processors, priority list, precedence order, execution times. T 1/3 T 2/2 T 3/2 T 4/2 1 2 3 4 9 5 6 7 8 T 5/4 T 6/4 T 7/4 T 8/4 T 9/9 Patel, UC Berkeley, PRET 0 3 12 25

Richard’s Anomalies: Reducing Execution Times • e. Time’ = e. Time - 1 T 1/2 T 2/1 T 3/1 1 2 3 4 9 5 6 7 8 T 5/3 T 6/3 T 7/3 T 8/3 T 9/8 Patel, UC Berkeley, PRET T 4/1 0 3 12 26

Richard’s Anomalies: More Processors • 4 processors T 1/3 T 2/2 T 3/2 T 4/2 1 2 3 4 9 5 6 7 8 T 5/4 T 6/4 T 7/4 T 8/4 T 9/9 Patel, UC Berkeley, PRET 0 3 12 15 27

Richard’s Anomalies: Changing Priority List • L = (T 1, T 2, T 4, T 5, T 6, T 3, T 9, T 7, T 8) T 1/3 T 2/2 T 3/2 T 4/2 1 2 6 3 7 4 3 8 9 T 5/4 T 6/4 T 7/4 T 8/4 T 9/9 Patel, UC Berkeley, PRET 0 3 12 28

Brittleness Again… • In general, all task scheduling strategies are brittle Patel, UC Berkeley, PRET 29