Presentation of Licentiate Thesis Scheduling and Optimization of

  • Slides: 56
Download presentation
Presentation of Licentiate Thesis Scheduling and Optimization of Fault. Tolerant Embedded Systems Viacheslav Izosimov

Presentation of Licentiate Thesis Scheduling and Optimization of Fault. Tolerant Embedded Systems Viacheslav Izosimov Embedded Systems Lab (ESLAB) Linköping University, Sweden 1 of 14 1

Motivation § Hard real-time applications § § Time-constrained Cost-constrained Fault-tolerant etc. § Focus on

Motivation § Hard real-time applications § § Time-constrained Cost-constrained Fault-tolerant etc. § Focus on transient faults and intermittent faults 2 of 14 2

Motivation Transient faults § Happen for a short time § Corruptions of data, miscalculation

Motivation Transient faults § Happen for a short time § Corruptions of data, miscalculation in logic § Do not cause a permanent damage of circuits Electromagnetic interference (EMI) § Causes are outside system boundaries Radiation Lightning storms 3 of 14 3

Motivation Intermittent faults Transient faults § Manifest similar as transient faults § Happen repeatedly

Motivation Intermittent faults Transient faults § Manifest similar as transient faults § Happen repeatedly § Causes are inside system Crosstalk boundaries Internal EMI Init (Data) Power supply fluctuations Software errors (Heisenbugs) 4 of 14 4

Motivation Transient faults are more likely to occur as the size of transistors is

Motivation Transient faults are more likely to occur as the size of transistors is shrinking and the frequency is growing Errors caused by transient faults have to be tolerated before they crash the system However, fault tolerance against transient faults leads to significant performance overhead 5 of 14 5

Motivation § Hard real-time applications § § Time-constrained Cost-constrained Fault-tolerant etc. The Need for

Motivation § Hard real-time applications § § Time-constrained Cost-constrained Fault-tolerant etc. The Need for Design Optimization of Embedded Systems with Fault Tolerance 6 of 14 6

Outline § Motivation è Background and limitations of previous work § Thesis contributions: §

Outline § Motivation è Background and limitations of previous work § Thesis contributions: § Scheduling with fault tolerance requirements § Fault tolerance policy assignment § Checkpoint optimization § Trading-off transparency for performance § Mapping optimization with transparency § Conclusions and future work 7 of 14 7

General Design Flow System Specification Architecture Selection Mapping & Hardware / Software Partitioning Fault

General Design Flow System Specification Architecture Selection Mapping & Hardware / Software Partitioning Fault Tolerance Techniques Feedback loops Scheduling Back-end Synthesis 8 of 14 8

Fault Tolerance Technique Re-execution Error-detection overhead N 1 PP 1/1 1 P 1/2 Recovery

Fault Tolerance Technique Re-execution Error-detection overhead N 1 PP 1/1 1 P 1/2 Recovery overhead Rollback recovery with checkpointing Checkpointing overhead N 1 1 P 1 2 P 1 0 P 1 20 N 1 1 P 1 40 1 2 60 PP 1/1 1 2 P 1/2 Active replication N 1 P 1(1) N 2 P 1(2) 9 of 14 9

Limitations of Previous Wo § Design optimization with fault tolerance is limited § Process

Limitations of Previous Wo § Design optimization with fault tolerance is limited § Process mapping is not considered together with fault tolerance issues § Multiple faults are not addressed in the framework of static cyclic scheduling § Transparency, if at all addressed, is restricted to a whol computation node 10 of 1410

Outline § Motivation § Background and limitations of previous work è Thesis contributions: §

Outline § Motivation § Background and limitations of previous work è Thesis contributions: § Scheduling with fault tolerance requirements § Fault tolerance policy assignment § Checkpoint optimization § Trading-off transparency for performance § Mapping optimization with transparency § Conclusions and future work 11 of 1411

Fault-Tolerant Time-Triggered Syste Transient faults Processes: Re-execution, Active Replication, Rollback Recovery with Checkpointing …

Fault-Tolerant Time-Triggered Syste Transient faults Processes: Re-execution, Active Replication, Rollback Recovery with Checkpointing … P 1 m 1 Messages: Fault-tolerant predictable protocol m 2 P 5 P 2 P 3 P 4 Maximum k transient faults within each application run (system period) 12 of 1412

Scheduling with Fault Tolerance Reqirements Conditional Scheduling Shifting-based Scheduling 13 of 1413

Scheduling with Fault Tolerance Reqirements Conditional Scheduling Shifting-based Scheduling 13 of 1413

Conditional Schedulin k = 2 P 1 m 1 0 20 40 60 80

Conditional Schedulin k = 2 P 1 m 1 0 20 40 60 80 100 120 140 160 180 200 P 2 true P 1 0 P 2 14 of 1414

Conditional Schedulin k = 2 P 1 m 1 0 20 P 2 40

Conditional Schedulin k = 2 P 1 m 1 0 20 P 2 40 60 80 100 120 140 160 180 200 P 2 true P 1 0 P 2 40 15 of 1415

Conditional Schedulin k = 2 P 1 PP 1/1 1 m 1 0 20

Conditional Schedulin k = 2 P 1 PP 1/1 1 m 1 0 20 P 1/2 40 60 80 100 120 140 160 180 200 P 2 true 45 P 1 0 P 2 40 16 of 1416

Conditional Schedulin k = 2 P 1/1 m 1 0 20 P 1/2 40

Conditional Schedulin k = 2 P 1/1 m 1 0 20 P 1/2 40 60 80 P 1/3 P 2 100 120 140 160 180 200 P 2 true 45 P 1 0 P 2 40 90 130 17 of 1417

Conditional Schedulin k = 2 P 1/1 m 1 0 20 P 1/2 40

Conditional Schedulin k = 2 P 1/1 m 1 0 20 P 1/2 40 60 PP 2/1 2 80 P 2/2 100 120 140 160 180 200 P 2 true 45 P 1 0 P 2 40 90 130 85 140 18 of 1418

Conditional Schedulin k = 2 P 1 m 1 0 20 PP 2/1 2

Conditional Schedulin k = 2 P 1 m 1 0 20 PP 2/1 2 40 60 P 2/2 80 P 2/3 100 120 140 160 180 200 P 2 true 45 P 1 0 P 2 40 90 130 85 140 95 150 19 of 1419

Fault-Tolerance Conditional Process Gra 1 P 11 k = 2 P 1 m 1

Fault-Tolerance Conditional Process Gra 1 P 11 k = 2 P 1 m 1 P 21 m 11 2 P 12 m 12 4 P 22 P 24 3 P 25 2 P 23 5 3 P 13 3 m 1 6 P 26 Conditional Scheduling 20 of 1420

Conditional Schedule Tab P 1 N 1 m 1 k = 2 N 2

Conditional Schedule Tab P 1 N 1 m 1 k = 2 N 2 P 2 true P 1 45 0 90 m 1 40 130 85 P 2 50 140 95 150 105 160 21 of 1421

Conditional Schedulin § Conditional scheduling: É Generates short schedules É Allows to trade-off between

Conditional Schedulin § Conditional scheduling: É Generates short schedules É Allows to trade-off between transparency and performance (to be discussed later. . . ) – Requires a lot of memory to store schedule tables – Scheduling algorithm is very slow § Alternative: shifting-based scheduling 22 of 1422

Shifting-based Schedulin § Messages sent over the bus should be scheduled at one time

Shifting-based Schedulin § Messages sent over the bus should be scheduled at one time § Faults on one computation node must not affect other computation nodes É Requires less memory É Schedule generation is very fast – Schedules are longer – Does not allow to trade-off between transparency and performance (to be discussed later. . . ) 23 of 1423

Ordered FT-CPG 1 P 1 k = 2 P 2 after P 1 P

Ordered FT-CPG 1 P 1 k = 2 P 2 after P 1 P 2 m 1 m 3 m 2 1 4 2 P 2 3 P 2 P 4 P 3 2 P 2 3 P 1 5 S S m 2 m 1 P 3 after P 4 P 1 6 P 2 S m 3 1 P 4 1 P 3 2 P 3 3 P 3 2 P 4 4 P 3 5 P 3 3 P 4 5 P 3 24 of 1424

Root Schedules Recovery slack for P 1 and P 2 Bus P 2 P

Root Schedules Recovery slack for P 1 and P 2 Bus P 2 P 1 Worst-case scenario for P 1 P 4 P 3 m 3 N 2 P 1 m 2 N 1 25 of 1425

Extracting Execution Scenari P 1 P 2 N 2 m 1 m 2 Bus

Extracting Execution Scenari P 1 P 2 N 2 m 1 m 2 Bus P 4/1 P 4/2 P 4/3 P 3 m 3 N 1 26 of 1426

Memory Required to Store Schedule Tab 20 proc. 40 proc. 60 proc. k=1 k=2

Memory Required to Store Schedule Tab 20 proc. 40 proc. 60 proc. k=1 k=2 k=3 k=1 k=2 80 proc. k=3 k=1 k=2 k=3 1. 73 0. 71 2. 09 4. 35 1. 18 4. 21 8. 75 100% 0. 13 0. 28 0. 54 0. 36 0. 89 1. 73 4. 96 1. 20 4. 64 11. 55 2. 01 8. 40 21. 11 75% 0. 22 0. 57 1. 37 0. 62 2. 06 4. 96 8. 09 1. 53 7. 09 18. 28 2. 59 12. 21 34. 46 50% 0. 28 0. 82 1. 94 0. 82 3. 11 8. 09 12. 56 1. 92 10. 00 28. 31 3. 05 17. 30 51. 30 25% 0. 34 1. 17 2. 95 1. 03 4. 3412. 56 16. 72 2. 16 11. 72 34. 62 3. 41 19. 28 61. 85 0% 0. 39 1. 42 3. 74 1. 17 5. 6116. 72 F Applications with more frozen nodes require less memory 27 of 1427

Memory Required to Store Root Sched 20 proc. 40 proc. 60 proc. k=1 k=2

Memory Required to Store Root Sched 20 proc. 40 proc. 60 proc. k=1 k=2 k=3 k=1 k=2 100% 0. 016 0. 034 0. 03 0. 054 80 proc. k=3 k=1 k=2 k=3 0. 070 1. 73 F Shifting-based scheduling requires very little memo 28 of 1428

Schedule Generation Time and Qual Shifting-based scheduling requires 0. 2 seconds to generate a

Schedule Generation Time and Qual Shifting-based scheduling requires 0. 2 seconds to generate a root schedule for application of 120 processes and 10 faults Conditional scheduling already takes 319 seconds to generate a schedule table for application of 40 processes and 4 faults F Shifting-based scheduling much faster than conditional scheduling ~15% worse than conditional scheduling with 100% inter-processor messages set to frozen (in terms of fault tolerance overhead) 29 of 1429

Fault Tolerance Policy Assignment Checkpoint Optimization 30 of 1430

Fault Tolerance Policy Assignment Checkpoint Optimization 30 of 1430

Fault Tolerance Policy Assignme 2 N 1 P 1(1)/1 N 1 P 1/2 P

Fault Tolerance Policy Assignme 2 N 1 P 1(1)/1 N 1 P 1/2 P 1/3 P 1(1)/2 N 2 P 1(2) N 3 P 1(3) Re-execution Replication Re-executed replicas 31 of 1431

Re-execution vs. Replicatio Deadline P 2(1) P 1(2) N 2 Missed N 1 N

Re-execution vs. Replicatio Deadline P 2(1) P 1(2) N 2 Missed N 1 N 2 P 3(2) P 1(1) Re-execution is better P 1 N 1 P 2 P 3(1) P 2(2) P 3(2) Met Replication is better Met N 1 P 2 P 3 Missed N 2 P 3 N 2 P 2(1) P 1(2) bus m 1(1) m 1(2) bus P 2(2) P 3(1) m 2(2) P 1(1) m 1(2) N 1 Deadline bus P 1 m 1 A 1 P 2 P 3 N 1 N 2 P 1 40 50 P 2 40 50 P 3 60 70 1 A 2 P 1 m 1 P 2 m 2 P 3 32 of 1432

Fault Tolerance Policy Assignment Deadline m 2 P 1 m 1 P 1(2) P

Fault Tolerance Policy Assignment Deadline m 2 P 1 m 1 P 1(2) P 4 PP 4 3(1) PP 2(2) 3 P 4(1) P 3(2) P 4(2) P 3 m 3 P 2 Met. Missed Optimization of fault tolerance policy assignment m 3(1) m 3(2) bus P 2 PP 2(1) 2 m 2 N 2 PP 1(1) 1 m m 2(1) 1(1) m 1(2) m 2(1) m 2(2) N 1 P 4 P 1 P 2 P 3 P 4 N 1 40 60 60 40 N 2 50 80 80 50 1 N 2 33 of 1433

Optimization Strategy § § Design optimization: § § Fault tolerance policy assignment Mapping of

Optimization Strategy § § Design optimization: § § Fault tolerance policy assignment Mapping of processes and messages Tabu-search § Root schedules Shifting-based scheduling Three tabu-search optimization algorithms: 1. Mapping and Fault Tolerance Policy assignment (MRX) § Re-execution, replication or both 2. Mapping and only Re-Execution (MX) 3. Mapping and only Replication (MR) 34 of 1434

Experimental Result Schedulability improvement under resource constraints Avgerage % deviation from MRX 100 90

Experimental Result Schedulability improvement under resource constraints Avgerage % deviation from MRX 100 90 Mapping and replication (MR) 80 70 60 50 40 30 Mapping and re-execution (MX) 20 10 0 Mapping and policy assignment (MRX) 20 40 60 80 100 Number of processes 35 of 1435

Checkpoint Optimizatio N 1 P 1 22 PP 1/1 1 2 P 1/2 22

Checkpoint Optimizatio N 1 P 1 22 PP 1/1 1 2 P 1/2 22 P 1 36 of 1436

Locally Optimal Number of Checkpoi No. of checkpoints 1 2 k = 2 P

Locally Optimal Number of Checkpoi No. of checkpoints 1 2 k = 2 P 1 1 3 P 1 4 P 1 5 P 1 1 1 c 1 = 5 ms P 1 2 3 P 1 2 P 1 1 = 10 ms P 1 3 P 1 1 = 15 ms 4 P 1 5 P 1 C 1 = 50 ms 37 of 1437

Globally Optimal Number of Checkpoi 1 P 1 1 P 2 2 P 1

Globally Optimal Number of Checkpoi 1 P 1 1 P 2 2 P 1 2 P 2 3 P 1 1 P 2 3 P 2 265 k = 2 P 1 m 1 P 2 P 1 P 2 255 c 10 5 10 P 1 P 2 C 1 = 50 ms C 2=60 ms 38 of 1438

Globally Optimal Number of Checkpoi a) b) 1 P 1 2 P 1 1

Globally Optimal Number of Checkpoi a) b) 1 P 1 2 P 1 1 P 1 3 P 1 2 P 1 1 P 2 k = 2 P 1 m 1 P 2 P 1 P 2 c 10 5 10 2 P 2 3 P 2 265 255 2 P 1 P 2 C 1 = 50 ms C 2=60 ms 39 of 1439

Globally Optimal Number of Checkpoi a) b) 1 P 1 2 P 1 1

Globally Optimal Number of Checkpoi a) b) 1 P 1 2 P 1 1 P 1 3 P 1 2 P 1 1 P 2 k = 2 P 1 m 1 P 2 P 1 P 2 c 10 5 10 2 P 2 3 P 2 265 255 2 P 1 P 2 C 1 = 50 ms C 2=60 ms 40 of 1440

% deviation from MC 0 (how smaller the fault tolerance overhead) Global Optimization vs.

% deviation from MC 0 (how smaller the fault tolerance overhead) Global Optimization vs. Local Optimizat 40% Does the optimization reduce the fault tolerance overheads on the schedule length? 30% 4 nodes, 3 faults 20% Global Optimization of Checkpoint Distribution (MC) 10% Local Optimization of Checkpoint Distribution (MC 0) 0% 40 60 80 100 Application size (the number of tasks) 41 of 1441

Trading-off Transparency for Performance Mapping Optimization with Transparency 42 of 1442

Trading-off Transparency for Performance Mapping Optimization with Transparency 42 of 1442

FT Implementations with Transparen – regular processes/messages – frozen processes/messages P 1 m 1

FT Implementations with Transparen – regular processes/messages – frozen processes/messages P 1 m 1 Frozen m 2 P 5 P 2 P 3 P 4 Transparency is achieved with frozen processes and messages Good for debugging and testing 43 of 1443

No Transparency processes start at different times P 1 N 2 P 4 m

No Transparency processes start at different times P 1 N 2 P 4 m 1 m 2 bus N 1 no fault scenario P 2 P 1 messages are sent at different times P 3 m 3 N 1 P 1 the worst-case fault scenario P 2 N 2 m 3 bus m 1 m 2 P 4 N 1 Deadline P 1 P 2 P 3 P 4 N 1 30 20 X X P 4 N 2 X X 20 30 P 3 = 5 ms k = 2 m 3 P 2 P 3 m 2 P 1 m 1 P 4 44 of 1444

Customized Transparenc Full Transparency Deadline P 4 PP 44 P 3 m 1 m

Customized Transparenc Full Transparency Deadline P 4 PP 44 P 3 m 1 m 2 no fault scenario No transparency P 2 P 3 m 3 P 2 P 1 m 3 P 11 Deadline Full transparency P 2 P 1 P 3 Customized transparency P 2 P 4 P 3 m 3 P 3 m 2 m 1 P 3 m 1 m 2 P 4 45 of 1445

Trading-Off Transparency for Performan increasing transparency 0% 25% 50% 75% 100% k=1 k=2 k=3

Trading-Off Transparency for Performan increasing transparency 0% 25% 50% 75% 100% k=1 k=2 k=3 k=1 k=2 k=3 20 24 44 63 32 60 92 39 74 115 48 83 133 48 86 139 29 43 20 40 40 58 28 49 49 72 34 60 60 90 39 66 66 97 40 17 29 60 12 24 34 13 30 43 19 39 58 28 54 79 32 58 86 80 8 16 22 10 18 29 14 27 39 24 41 66 27 43 73 § How longer is the F Trading transparency for performance is essential Four (4) computation nodes schedule length with Recovery time 5 ms fault tolerance? 46 of 1446

Mapping with Transparenc Deadline N 1 P 1 bus N 1 P 4/1 bus

Mapping with Transparenc Deadline N 1 P 1 bus N 1 P 4/1 bus P 1 P 5 optimal mapping without transparency P 6 P 4/2 P 3 P 4/3 P 2 P 5 the worst-case fault scenario for optimal mapping P 6 m 1 N 2 P 3 m 1 N 2 P 4 N 1 N 2 P 1 P 2 P 3 P 4 P 5 P 6 N 1 30 40 50 60 40 50 N 2 30 40 50 60 40 50 = 10 ms k = 2 m 1 P 1 m 2 P 3 m 3 P 5 P 4 m 4 P 6 47 of 1447

Mapping with Transparenc Deadline N 1 P 1 N 2 bus P 3 P

Mapping with Transparenc Deadline N 1 P 1 N 2 bus P 3 P 2/3 P 5 the worst-case fault scenario with transparency for “optimal” mapping P 6 m 1 bus P 2/2 P 2/1 P 2 P 4/1 P 3 P 5 P 4/2 the worst-case fault scenario with transparency and optimized mapping P 4/3 P 6 m 2 N 2 P 4 N 1 N 2 P 1 P 2 P 3 P 4 P 5 P 6 N 1 30 40 50 60 40 50 N 2 30 40 50 60 40 50 = 10 ms k = 2 m 1 P 1 m 2 P 3 m 3 P 5 P 4 m 4 P 6 48 of 1448

Design Optimizatio Hill-climbing mapping optimization heuristic Schedule length 1. Conditional Scheduling (CS) Slow 2.

Design Optimizatio Hill-climbing mapping optimization heuristic Schedule length 1. Conditional Scheduling (CS) Slow 2. Schedule Length Estimation (SE) Fast 49 of 1449

Experimental Result 25% of processes and 50% of messages are frozen 4 nodes 15

Experimental Result 25% of processes and 50% of messages are frozen 4 nodes 15 applications k = 2 faults k = 3 faults Recovery overhead = 5 ms SE SE CS CS k = 4 faults SE CS 20 processes 0. 01 0. 07 0. 02 0. 28 0. 04 1. 37 30 processes 0. 13 0. 39 0. 19 2. 93 0. 26 31. 50 40 processes 0. 69 s 0. 32 1. 34 0. 50 17. 02 0. 69318. 88 s 318. 88 § Schedule length estimation (SE) is more How faster is schedule length estimation (SE) than 400 times faster than compared to conditional scheduling (CS)? conditional scheduling (CS) 50 of 1450

Experimental Result 4 computation nodes 15 applications Recovery overhead = 5 ms 25% of

Experimental Result 4 computation nodes 15 applications Recovery overhead = 5 ms 25% of processes and 50% of messages are frozen k = 2 faults k = 3 faults k = 4 faults 20 processes 32. 89% 30 processes 35. 62% 40 processes 28. 88% 32. 20% 30. 56% 31. 68% 30. 58% 31. 68% 28. 11% 28. 03% Schedule length of § How much is the improvement when fault-tolerant applications is 31. 68% transparency is taken into account? shorter on average if transparency was considered during mapping 51 of 1451

Outline § Motivation § Background and limitations of previous work § Thesis contributions: §

Outline § Motivation § Background and limitations of previous work § Thesis contributions: § Scheduling with fault tolerance requirements § Fault tolerance policy assignment § Checkpoint optimization § Trading-off transparency for performance § Mapping optimization with transparency è Conclusions and future work 52 of 1452

Conclusions § Scheduling with fault tolerance requirements § Two novel scheduling techniques § Handling

Conclusions § Scheduling with fault tolerance requirements § Two novel scheduling techniques § Handling customized transparency requirements, trading-off transparency for performance § Fast scheduling alternative with low memory requirements for schedules 53 of 1453

Conclusions § Design optimization with fault tolerance § Policy assignment optimization strategy § Estimation-driven

Conclusions § Design optimization with fault tolerance § Policy assignment optimization strategy § Estimation-driven mapping optimization that can handle customized transparency requirements § Optimization of the number of checkpoints FApproaches and algorithms have been evaluated on the large number of synthetic applications and a real life example – vehicle cruise controller 54 of 1454

Design Optimization of Embedded Systems with Fault Tolerance is Essential 55 of 1455

Design Optimization of Embedded Systems with Fault Tolerance is Essential 55 of 1455

Future Work Some More… Fault-Tree Analysis Probabilistic Fault Model Soft Real-Time 56 of 1456

Future Work Some More… Fault-Tree Analysis Probabilistic Fault Model Soft Real-Time 56 of 1456