The Model Ulrich Schmid Josef Widder Martin Hutle

  • Slides: 51
Download presentation
The Θ-Model Ulrich Schmid Josef Widder Martin Hutle Daniel Albeseder Vienna University of Technology

The Θ-Model Ulrich Schmid Josef Widder Martin Hutle Daniel Albeseder Vienna University of Technology Embedded Computing Systems Group http: //www. ecs. tuwien. ac. at 3/6/2021 Gérard Le Lann Jean-François Hermant INRIA Rocquencourt Project Novaltis http: //www. inria. fr Theta-Model (Version 1. 3) 1

Motivation Theta-Model 2

Motivation Theta-Model 2

Timed Algorithms • Most FT algorithms for distributed RTS have explicit time values (unit

Timed Algorithms • Most FT algorithms for distributed RTS have explicit time values (unit „seconds“) in their code / variables • Toy example: Local real-time clock for timing out a crashed process msg_pong = do_roundtrip(msg_ping, p) send msg_ping to p TIMEOUT : = C(t) + 2τ+ /* max. e. -t. -e. delay τ+ (sec) */ while C(t) < TIMEOUT do nothing if msg_pong did not arrive then msg_pong : = NIL return msg_pong Theta-Model 3

Implications ? • Safety properties like consistency of replicated data may depend upon non-NIL

Implications ? • Safety properties like consistency of replicated data may depend upon non-NIL operation of do_roundtrip • Usual assumption: Real-time systems must always meet their timeliness properties ðOnly possible if all end-to-end delays δ ≤ τ+ ðSafety properties also guaranteed in this case • BUT: Bounds like τ+ that always hold are very difficult to determine for real systems • Fail-operational systems might be allowed to sometimes lose timeliness – but never lose consistency Theta-Model 4

Why is determining τ+ difficult ? • Queuing phenomenons: – Simultaneous messages from different

Why is determining τ+ difficult ? • Queuing phenomenons: – Simultaneous messages from different peers (CPU) – Multiple processes (CPU) – Multiple messages (Link) • End-to-end delays hence depend upon – message & computational complexity of algorithms – interaction („blocking factors“) – load conditions – scheduling disciplines Theta-Model 5

Importance of Scheduling ? • τ+ can be huge in real systems since all

Importance of Scheduling ? • τ+ can be huge in real systems since all messages [including application-level] must be taken into account – Maximum determines synchronous round duration too conservative for most messages – Escape: Appropriate scheduling • Fast Failure Detectors by Hermant & Le Lann [HLL 02] – Use Head-of-the-Line Scheduling for FD-level processes and messages – Only blocking factors due to non-preemptible resources can lead to priority inversion phenomenons on FD-level ðτ+ relevant for failure detection latency reduced by orders of magnitude Theta-Model 6

But still … Hermant & Le Lann [HLL 02]: τ+ = γ(n) with (Note

But still … Hermant & Le Lann [HLL 02]: τ+ = γ(n) with (Note that wout. Q, woutq and winq are the problematic parts here) • Do you trust a real system to always obey this, during the whole mission time? • Do you really want your safety and liveness properties to depend on this? Theta-Model 7

Alternatives ? Ø Are there ways to guarantee logical safety & liveness properties independently

Alternatives ? Ø Are there ways to guarantee logical safety & liveness properties independently of the timing properties of the underlying system ? YES: Asynchronous algorithms (time-free, message-driven) Ø Are there suitable time-free computational models and algorithms ? YES: Θ-Model Theta-Model 8

Roadmap of our Presentation Overview of Computational Models ₪ The Θ-Model ₪ First Experimental

Roadmap of our Presentation Overview of Computational Models ₪ The Θ-Model ₪ First Experimental Results ₪ Applications Theta-Model 9

Overview of Computational Models Theta-Model 10

Overview of Computational Models Theta-Model 10

The FLP Asynchronous Model (I) • Fischer, Lynch & Paterson [FLP 85] • System

The FLP Asynchronous Model (I) • Fischer, Lynch & Paterson [FLP 85] • System of n processes communicating via reliable point -to-point network – Every message sent is eventually delivered – No bounded-drift clocks available • Computational step times are non-negative, finite but unbounded (i. e. , can exceed any a priori given bound) • Message transmission delays are non-negative, finite but unbounded Theta-Model 11

The FLP Asynchronous Model (II) • FLP model has no timing assumption at all

The FLP Asynchronous Model (II) • FLP model has no timing assumption at all cannot be violated at runtime • BUT: In the FLP-Model, it is impossible to distinguish a slow from a crashed process ð Important DC problems like consensus impossible to solve in the FLP-Model in the presence of failures ð For solvability, some property/properties must be added to the pure FLP model. Theta-Model 12

The FLP Asynchronous Model (III) • Resulting spectrum of models: FLP partially synchronous •

The FLP Asynchronous Model (III) • Resulting spectrum of models: FLP partially synchronous • Clearly: The stronger the added property the less is the assumption coverage in real systems • Usually: Add explicit timeliness properties to the FLPModel • Sometimes: Add implicit timeliness properties to the FLP-Model (time-free models) Theta-Model 13

(Close to) Synchronous Models • Synchronous model allows simulation of lock-step rounds – Transmission

(Close to) Synchronous Models • Synchronous model allows simulation of lock-step rounds – Transmission delay bound Δ – Computing step time bound σ – Bounded-drift local clocks available • Timed Asynchronous Model by Cristian & Fetzer [CF 99] – – Transmission delay bound Δ Computing step time bound σ Bounded-drift local clocks available BUT: Fail awareness allows bounds Δ and σ to be violated arbitrarily often fail-safe behavior Theta-Model 14

Partially Synchronous Models (I) • Dwork, Lynch & Stockmeyer [DLS 88], Ponzio & Strong

Partially Synchronous Models (I) • Dwork, Lynch & Stockmeyer [DLS 88], Ponzio & Strong [PS 92], Attiya, Dwork, Lynch & Stockmeyer [ADLS 94] – Transmission delay bound Δ – Bounded ratio of max. over min. computing step times Φ – Bounds unknown / known but hold from unknown time GST on • Every process can locally time-out messages: – [PS 91, ADLS 94]: Semi-synchonous model assumes availability of bounded-drift local clocks – [DLS 88]: Computing steps of fastest processor are used as realtime units [= unit of Δ !] local clock with bounded rate [1/Φ, 1] implementable via spin-loop Theta-Model 15

Partially Synchronous Models (II) • Archimedean model by Vitany [Vit 85] – Bounded ratio

Partially Synchronous Models (II) • Archimedean model by Vitany [Vit 85] – Bounded ratio s ≥ u/m on min. computing step time (m) and max. computing step time + max. transmission delay (u) – s is dimensionless – Every process can again locally time-out messages [via spinning for s steps] • Finite Average Round-Trip-Time Model by Fetzer & Schmid [FS 04] – Unknown lower bound for computing step time – Stubborn links with unknown average round-trip time bound – Every process can implement „weak clock“ via spin-loop Theta-Model 16

FLP-Model with Failure Detectors • Replace explicit timeliness properties by unreliable failure detectors •

FLP-Model with Failure Detectors • Replace explicit timeliness properties by unreliable failure detectors • FDs are local oracles based upon a list of suspected processes – Completeness: Every crashed process is eventually suspected – Accuracy: No correct process is suspected • FLP-Model + FDs allow most important distributed computing problems to be solved • BUT: Implementing FDs in a real system necessarily requires a system model stronger than FLP back at initial problem Theta-Model 17

The Θ-Model Theta-Model 18

The Θ-Model Theta-Model 18

Time-Free Message-Timeout in Par. Sync ? • Implementation of do_roundtrip(msg_ping, p) using a spinloop

Time-Free Message-Timeout in Par. Sync ? • Implementation of do_roundtrip(msg_ping, p) using a spinloop in the parsync models of [DLS 88] or [Vit 85]: send msg_ping to p for i=1 to x do no-op /* x=f(Δ, Φ) resp. x=f(s) is dimensionless! */ if msg_pong did not arrive then msg_pong : = NIL return msg_pong • The algorithm is – time-free since neither code nor variables contain real-time values (unit „seconds“) ! – not message-driven Theta-Model 19

But … • There is the ([DLS 88]: hidden, [Vit 85]: explicit) assumption that

But … • There is the ([DLS 88]: hidden, [Vit 85]: explicit) assumption that all timing values/bounds are multiples of the min. computing step time (m) • The algorithm would be time-free only if m could vary arbitrarily • Since there is no physically evident correlation between transmission delay and computing step time, however, – m cannot vary arbitrarily without violating the physical (realtime) transmission delay bound [since Δ resp. s are fixed] – Assuming fixed Δ resp. s hence makes sense for essentially constant m only • Not time-free in reality since m unit real-time! Theta-Model 20

Still: Can we make this idea working ? • The problem with the previous

Still: Can we make this idea working ? • The problem with the previous algorithm is that computing step times and transmission delays are uncorrelated • Key idea: Replace unit time „fastest computing step“ of [DLS 88], [Vit 85] by „fastest end-to-end delay“ ð Just assume that, during any round-trip, there may not be more that Θ other successive roundtrips (anywhere in the system) Theta-Model 21

Time-free implementation of do_roundtrip(. ) send msg_ping to p for i=1 to Θ do

Time-free implementation of do_roundtrip(. ) send msg_ping to p for i=1 to Θ do /* Θ is dimensionless ! */ begin /* do additional roundtrips for waiting */ send delay_ping(i) to process q wait for delay_pong(i) from process q end if msg_pong did not arrive then msg_pong : = NIL return msg_pong The algorithm is – time-free since Θ is dimensionless – fully message-driven since all events are triggered by message receptions only Theta-Model 22

Time-free implementation of do_roundtrip(. ) q r p Θ=5 1 2 3 msg_ping 4

Time-free implementation of do_roundtrip(. ) q r p Θ=5 1 2 3 msg_ping 4 5 msg_pong D • Timing behavior solely emerges from the underlying system [D adapts automatically to actual speed] • Consider execution in a synchronous system: – End-to-end delays δ satisfy τ− ≤ δ ≤ τ+ with τ+ / τ− ≤ Θ = 5 Termination within 10 τ− ≤ D ≤ 10 τ+ – τ+ = 100 us D ≤ 1 ms ◊ τ+ = 1 s D ≤ 10 s Theta-Model 23

Performance ? • Is doing continuous successive round-trips for delay purposes prohibitively expensive? NO!

Performance ? • Is doing continuous successive round-trips for delay purposes prohibitively expensive? NO! • (a) Reasonably large delay * bandwidth product: – τ+ = 1 ms with 1 Mbit/sec peer-to-peer bandwidth allows to send 1000 bit per message – do_roundtrip(. ) needs only a few bit of message data ðOnly a few % overhead for continuous round-trips! • (b) Small delay * bandwidth product: – Use timer to separate multiple instances of do_roundtrip(. ) – No bounded drift timer required here Implementable without hardware clock by counting some local events Theta-Model 24

The Θ-Model (Simple Version) • FLP-Model + • End-to-end delays δ of all messages

The Θ-Model (Simple Version) • FLP-Model + • End-to-end delays δ of all messages in transit at t – minimum τ−(t) – maximum τ+(t) • τ+(t) and τ−(t) may vary arbitrarily with time, but • ratio Θ(t) = τ+(t)/τ−(t) must remain bounded by some [known or even unknown] Θ for every time t Theta-Model 25

Key Question • Can we indeed expect a (positive) correlation between τ+(t) and τ−(t)

Key Question • Can we indeed expect a (positive) correlation between τ+(t) and τ−(t) in a real system? • Shared channel-type networks [Deterministic Ethernet]: Theoretical analysis by Hermant & Widder [HW 04] has shown that Θ close to 1 can be achieved • Fully connected systems: First experimental evaluation of a simple Θ clock synchronization algorithm by Albeseder [Alb 04] confirms correlation Theta-Model 26

Reason for such a correlation ? • Restriction to broadcast communication (shared channel or

Reason for such a correlation ? • Restriction to broadcast communication (shared channel or multiple point to point sends in a fully connected network) • (Part of) the messages populating the queues from p → q also sure/likely to populate queues from p → r, and even from s → r Sender p Link p → q Chan CPU Arrival at p Receiver q δpq= 10 Link q → x CPU Cha Processed at q t Processed at r δpr = 7 Sender s CPU Link p → r Link s → r Chan Receiver r Link r → y Cha CPU Chan Theta-Model 27

Correlation Coverage Expansion • Given some bound τ+ and τ− assumed during system design

Correlation Coverage Expansion • Given some bound τ+ and τ− assumed during system design (also used in synchronous systems), compute Θ = τ+ / τ− • Unanticipated overload: τ+(t) > τ+ — if τ+(t) ≤ Θτ−(t), however, end-toend delays δ Synchronous system out of spec Θ-system still OK Note: • τ+(t) = τ+ + α(t) t Theta-Model • τ −(t) = τ + α(t)/Θ suff. for Θ to hold 28

Still: Shortcomings Simple Θ-Model • The predicted correlation need not exist for every fast

Still: Shortcomings Simple Θ-Model • The predicted correlation need not exist for every fast message but only for some – Some very fast messages [even τ− = 0] may be in transit somewhere in the system even during a slow message – Correlation and hence coverage expansion does not exist in such cases • Need a more relaxed definition of the relation between slow and fast messages – All that is actually needed is to constrain the number of fast messages during a slow one – No need for a correlation at every point in time t Theta-Model 29

The Θ-Model (Generalized Version) • Consider chain of k ≥ 1 successive messages •

The Θ-Model (Generalized Version) • Consider chain of k ≥ 1 successive messages • Longest chain of „covered“ causal messages ≤ kΘ k=2 successive (slow) messages ≤ kΘ = 9 causally dependent (fast) messages Θ = 4. 5 τ+(t 1) τ+(t 2) Advantage: Messages with τ−(t) = 0 allowed here! Theta-Model 30

Partial Order of Partially Synchronous Models • DLS … [DLS 88] with a priori

Partial Order of Partially Synchronous Models • DLS … [DLS 88] with a priori known Δ, Φ FLP • Θ … Θ-Model with a priori known Θ Θu • DLSu … [DLS 88] with a priori unknown Δ, Φ Θ • Θu … Θ-Model with a priori unknown Θ DLSu DLS • FLP … FLP-Model Theta-Model 31

Existing Θ-Algorithms • Perfect failure detectors [Schmid and Le Lann 2003] • Clock synchronization

Existing Θ-Algorithms • Perfect failure detectors [Schmid and Le Lann 2003] • Clock synchronization (+ system booting) [Widder 2003], [Widder and Schmid 03] • Eventually perfect failure detectors / system booting [Widder, Le Lann and Schmid 2003] • Fast failure detectors atop of Deterministic Ethernet [Widder and Hermant 2004] • Self-stabilizing failure detectors & impossibility results [Hutle and Widder 2004] • Synchronizer, SDD problem, atomic commitment, etc. [Widder’s Ph. D 2004] Theta-Model 32

http: //www. ecs. tuwien. ac. at/~widder/Theta/ anks ! Theta-Model 33

http: //www. ecs. tuwien. ac. at/~widder/Theta/ anks ! Theta-Model 33

First Experimental Results Theta-Model 34

First Experimental Results Theta-Model 34

Remember Key Question: • Can we indeed expect a (positive) correlation between τ+(t) and

Remember Key Question: • Can we indeed expect a (positive) correlation between τ+(t) and τ−(t) in a real system? • Alternatively: Let Θ = τ+ / τ− with – τ− = mint τ−(t) being the total minimum for all t – τ+ = maxt τ+(t) being the total maximum for all t Is it the case that Θ(t) < Θ ? • How often and how much gain Θ/Θ(t) ? Theta-Model 35

Evaluation Setup • Master thesis by Daniel Albeseder [Alb 04] • Pentium 4 workstations

Evaluation Setup • Master thesis by Daniel Albeseder [Alb 04] • Pentium 4 workstations (2, 4 GHz FSB 533) • Fully switched Fast-Ethernet over two Cisco Catalyst 2950 switches (connected over fiber Gigabit-Ethernet backbone) • Red Hat Linux 7. 2 with 2. 4. 20 kernel, patched with High-Resolution-Timers and Kernel-Preemption Theta-Model 36

Evaluation Parameter Settings • n = 4 processors with at most f = 1

Evaluation Parameter Settings • n = 4 processors with at most f = 1 faulty ones • Head-of-line process scheduling (Linux RT Priorities) • High message priority (low latency bit in TOS-byte), but no head-of-the-line message scheduling • Simulated broadcast (= multiple point-to-point sends) • Fixed message length: 36 bytes • Inter-round delay: 1 ms • Duration evaluation run: 10 … 100 s - range Theta-Model 37

System Design ctrlpsa evalpsa Fully switched Fast-Ethernet Theta-Model 38

System Design ctrlpsa evalpsa Fully switched Fast-Ethernet Theta-Model 38

Control Communication done stop … ters parame start boot change ctrlpsa t init run

Control Communication done stop … ters parame start boot change ctrlpsa t init run algorithm store t evalpsa Phases: booting running Theta-Model collecting 39

Evalpsa Structure Theta-Model 40

Evalpsa Structure Theta-Model 40

Data Analysis • Consider only clock synchronization messages • τ−(t), τ+(t), Θ(t) etc. only

Data Analysis • Consider only clock synchronization messages • τ−(t), τ+(t), Θ(t) etc. only evaluated at times t where some rule of the algorithm fires („effective Θ“) • Approximation of one-way delays via round trip delays for simplicity (i. e. , we assume that both messages of a round-trip have the same delay) • The clock of one designated processor is used as global timebase, all timestamps are a-posteriori adjusted to this global timebase Theta-Model 41

Glossary of variables • τ−(t), τ+(t): Min. and max. delay of all messages in

Glossary of variables • τ−(t), τ+(t): Min. and max. delay of all messages in transit at some time t • Θ(t) = τ+(t)/ τ−(t) • Θ = maxt Θ(t) • τ−, τ+: Min. and max. delay of all messages in transit at all times during the evaluation run • Θ = τ +/ τ − • Gain = Θ/Θ Theta-Model 42

Θ Theta-Model 43

Θ Theta-Model 43

Θ/Θ Theta-Model 44

Θ/Θ Theta-Model 44

Continuously Increasing Network Load Θ(t) Theta-Model 45

Continuously Increasing Network Load Θ(t) Theta-Model 45

Conclusions from First Experiments • There is definitely a positive correlation between τ+(t) and

Conclusions from First Experiments • There is definitely a positive correlation between τ+(t) and τ−(t) in the evaluation setting, even – with significant gain – always achieved • Although we cannot infer from this that there is always a correlation between τ+(t) and τ−(t) here, – it is very likely that there are scenarios where some assumed Θ holds despite of the fact that some assumed τ+ is violated – the Θ-model is very likely to have higher coverage that a synchronous solution • More thorough experimental and theoretical evaluation [of more suitable systems] will follow Theta-Model 46

Applications Theta-Model 47

Applications Theta-Model 47

„Exotic“ Application: VLSI Chips • Interconnect delays dominate over switching delays • Shrinking feature

„Exotic“ Application: VLSI Chips • Interconnect delays dominate over switching delays • Shrinking feature size • Increasing complexity • Increasing clock speed • Signals cannot traverse entire chip within a single clock cycle • Increasing susceptibility to transient failures (particles, cross-talk, …) • High power-consumption Theta-Model 48

Clock Generation in Systems-on-a-Chip • Illusion of chip-wide synchrony increasingly difficult to maintain data

Clock Generation in Systems-on-a-Chip • Illusion of chip-wide synchrony increasingly difficult to maintain data bus • Extend every functional unit with simple local CS algorithm fu 1 CS network fu 2 Clock tree fu 3 CS algs clock Distributed clock • CS algorithms communicate via dedicated clocking signals • CS algs guarantee – | Ci(t) – Cj(t) | ≤ π (Θ) – Next tick happens every max delay – Data sent by fui by tick k available at fuj by tick k+Ξ(Θ) at latest • Division by Ξ provides global macro tick abstraction Theta-Model 49

Benefits • CS algs simulate global clock – Synchronous design abstraction maintained – Self-clocking

Benefits • CS algs simulate global clock – Synchronous design abstraction maintained – Self-clocking feature: Chip runs as fast as routing delays allow – Θ is estimated by place and route tools – Explicit dependence upon routing only via Θ [required for determining macro-tick division factor Ξ(Θ) only] • Distributed clocks tolerate transient failures – Need n > 6 fl FUs for tolerating up to fl transient failures (affecting clocking signals) per FU in every tick – Additional (data) fault-tolerance possible via replicated FUs employing synchronous Byzantine agreement algorithms etc. • [WS 03]: CS algs work also for non-simultaneous reset Theta-Model 50

http: //www. ecs. tuwien. ac. at/~widder/Theta/ anks ! Theta-Model 51

http: //www. ecs. tuwien. ac. at/~widder/Theta/ anks ! Theta-Model 51