TDDD 07 Realtime Systems Lecture 4 Distributed Systems

TDDD 07 Real-time Systems Lecture 4: Distributed Systems Simin Nadjm-Tehrani Real-time Systems Laboratory Department of Computer and Information Science Linköping university Undergraduate course on Real-time Systems Linköping 43 pages Autumn 2009

Reading material • Course book: Chapter 6. 8 and 9. 1 of Burns & Wellings (not language specific parts) • E-books at Li. U library: – See web page links Undergraduate course on Real-time Systems Linköping 2 of 43 Autumn 2009

This lecture • Overview of some basic notions in timing and how distributed systems are affected by them • Time, clock synchronisation, order of events, and logical clocks . . . Undergraduate course on Real-time Systems Linköping 3 of 43 Autumn 2009

Applications • • Banking systems On-line access & electronic services Peer-to-Peer networks Distributed control – Cars, Airplanes • Sensor and ad hoc networks – Buildings, Environment • Grid computing Undergraduate course on Real-time Systems Linköping 4 of 43 Autumn 2009

Common in all these? Distributed model of computing: • • Multiple processes Disjoint address spaces Inter-process communication Collective goal Undergraduate course on Real-time Systems Linköping 5 of 43 Autumn 2009

Synchrony vs. asynchrony • Model for distributed computations depends on – the rate at which computations are done at each node (process) – the expected delay for transmission of messages • Synchronous: There is a bound on message delays, and the rates of computation at different processes can be related • Asynchronous: No bounds on message delays and no known relation among the processing speeds at different nodes Undergraduate course on Real-time Systems Linköping 6 of 43 Autumn 2009

The choice • Which model is harder to use? • What it means to be hard or easy to use? • How do implementations of real systems relate to the various models? Undergraduate course on Real-time Systems Linköping 7 of 43 Autumn 2009

Implications • Synchronous: – Local clocks can be used to implement timeouts – Lack of response from another node can be interpreted as detection of failure • Asynchronous: – In the absence of global (synchronised) time the only system wide abstraction of time is order of events Undergraduate course on Real-time Systems Linköping 8 of 43 Autumn 2009

Reasons for distribution • Locality – Engine control, brake system, gearbox control, airbag, … • Organisation – An extension of modularisation, and means for fault containment • Load sharing – Web services, search, parallelisation of heavy duty computations Undergraduate course on Real-time Systems Linköping 9 of 43 Autumn 2009

Local control Simplistic view: • It is all about data: each local controller can perform its computations properly if data it needs is accessed locally • Design modules with high cohesion and low interaction! • But when data needs to be shared, how do we ensure that nodes have fresh data and act in concert with other nodes? Undergraduate course on Real-time Systems Linköping 10 of 43 Autumn 2009

Organisation and containment Simplistic view: • If module interactions are well-defined they do not affect each other even if things go wrong • But fault tolerance is a much harder problem in distributed systems, and timing has a big role in it More on this in dependability lecture Undergraduate course on Real-time Systems Linköping 11 of 43 Autumn 2009

Sharing the load Simplistic view: • Guarantee that a node can deal with what it accepts • Spread the load so that tasks are (globally) serviced in a best effort manner • But communication and cooperation overheads affect the global distributed service Undergraduate course on Real-time Systems Linköping 12 of 43 Autumn 2009

Common issues • Time: Sharing data may require knowledge of local time at the generating node, and comparison with the time at the consuming node • State: Sometimes nodes need to agree on a common state/value in order to achieve a globally correct behaviour • Faults in the system affect both Undergraduate course on Real-time Systems Linköping 13 of 43 Autumn 2009

Major requirements • In distributed systems: – Interoperability – Transparency – Scalability – Dependability • This course focuses on dependability: fault tolerance and timing related issues Undergraduate course on Real-time Systems Linköping 14 of 43 Autumn 2009

Brake-by-wire Undergraduate course on Real-time Systems Linköping 15 of 43 Autumn 2009

Contributing to safety • Redundancy: Having distributed sensors and actuators makes brake control more fault-tolerant central decision or distributed decision? • Central decision: – what if one node gets the signal incorrectly or late? • Distributed decision: – what if one node is acting differently? Undergraduate course on Real-time Systems Linköping 16 of 43 Autumn 2009

Time in Distributed Systems • • The role of time in distributed systems Logical time vs. physical time Clock synchronisation algorithms Vector clocks Undergraduate course on Real-time Systems Linköping 17 of 43 Autumn 2009

Time matters… • Inaccurate local clocks can be a problem if the result of computations at different nodes depend on time – Calculation of trajectories: if a missile was at a given point of time before a computation where will it be after the computation? – If the break signal is issued separately in different wheels will the car stop, and when? Undergraduate course on Real-time Systems Linköping 18 of 43 Autumn 2009

Banking and finance • The rate of interest is applied to funds – at a given point in time – to a balance that reflects related transactions prior to that point • The gain/loss on sales of stocks is dependent on dynamic values of stocks at a given time (the time of sale/purchase) Undergraduate course on Real-time Systems Linköping 19 of 43 Autumn 2009

Local vs. global clock • Most physical (local) clocks are not always accurate • What is meant by accurate? – Agreement with UTC – Coordinated Universal Time (UTC) is in turn coordinated to adjust for the variations in the rotation of earth to agree with International Atomic Time (IAT) • Local clocks need to be synchronised regularly • An atomic global clock accurately measures IAT • If local clocks are synchronised with an (accurate) global clock we may be able to use a synchronous model in the application Undergraduate course on Real-time Systems Linköping 20 of 43 Autumn 2009

Clock synchronisation Two types of algorithms: • Internal synchronisation – Tries to keep a set of clock values close to each other with a maximum skew of δ • External synchronisation – Tries to keep the values of a set of clocks agree with an accurate clock, with a skew of δ Undergraduate course on Real-time Systems Linköping 21 of 43 Autumn 2009

Lamport/Melliar-Smith Algorithm • Internal synchronisation of n clocks • Each clock reads the value of all other clocks at regular intervals – If the value of some clock drifts from the own clock by more than δ, that clock value is replaced by own clock value – The average of all clocks is computed – Own clock value is updated to the average value Undergraduate course on Real-time Systems Linköping 22 of 43 Autumn 2009

Does it work? • After each synchronisation interval the clocks get closer to each other • If the drifts are within δ, and the clocks are initially synchronised then they are kept within δ from each other • But what if some clocks give faulty values? Undergraduate course on Real-time Systems Linköping 23 of 43 Autumn 2009

Faulty clocks • If a clock drifts by more than δ its value is eliminated – does not “harm” other clocks • What if it drifts by exactly δ? – check it as an exercise! • What is the worst case? Undergraduate course on Real-time Systems Linköping 24 of 43 Autumn 2009

A two-face faulty clock k c i j c+d c-2 d c-d k Will be considered as correct by i and j… Undergraduate course on Real-time Systems Linköping 25 of 43 Autumn 2009

Bound on the faulty clocks • To guarantee that the set will keep δ we need an assumption on the number of faulty clocks • For t faulty clocks the algorithm works if the number of clocks n >3 t Undergraduate course on Real-time Systems Linköping 26 of 43 Autumn 2009

Logical time • Sometimes order will do • In the absence of exact synchronisation we may use order that is intrinsic in an application Client A Req. A Client B Server Rep. A Req. B Undergraduate course on Real-time Systems Linköping 27 of 43 Autumn 2009

Logical clocks • Based on event counts at each node • May reflect causality • Sending a message always precedes receiving it • Messages sent in a sequence by one node are (potentially) causally related to each other – I do not pay for an item if I do not first check the item’s availability Undergraduate course on Real-time Systems Linköping 28 of 43 Autumn 2009

Happened before~~~~ • Assume each process has a monotonically increasing physical clock • Rule 1: if the time for event a is before the time for event b then a b • Rule 2: if a denotes sending a message and b denotes receiving the same message then a b • Rule 3: is transitive Undergraduate course on Real-time Systems Linköping 29 of 43 Autumn 2009

A partial order • Any events that are not in the “happened before” relation are treated as concurrent • Logical clock: An event counter that respects the “happened before” ordering • Sometimes referred to as Lamport’s clocks (author of first paper in this topic: 1978) Undergraduate course on Real-time Systems Linköping 30 of 43 Autumn 2009

What do we know here? P Q g a e b R c f Undergraduate course on Real-time Systems Linköping h d 31 of 43 Autumn 2009

Implementing a logical clock • LC 1: Each time a local event takes place increment LC by 1 • LC 2: Each time a message m is sent the LC value at the sender is appended to the message (m_LC) • LC 3: Each time a message m is received set LC to max(LC, m_LC)+1 Undergraduate course on Real-time Systems Linköping 32 of 43 Autumn 2009

Exercise • Calculate LC for all events in the given example Undergraduate course on Real-time Systems Linköping 33 of 43 Autumn 2009

What does LC tell us? • a → b LC(a) < LC(b) • Note that: LC(d) < LC(h) does not imply Undergraduate course on Real-time Systems Linköping d h 34 of 43 Autumn 2009

Is concurrency transitive? • e is concurrent with g • g is concurrent with f • but e is not concurrent with f! • Vector clocks bring more. . . Undergraduate course on Real-time Systems Linköping 35 of 43 Autumn 2009

Vector clocks (VC) • Every node maintains a vector of counted events (one entry for each other node) • VC for event e, VC(e) = [1, …, n], shows the perceived count of events at nodes 1, …, n • VC(e)[k] denotes the entry for node k Undergraduate course on Real-time Systems Linköping 36 of 43 Autumn 2009

Example revisited P Q g a e b R c f Undergraduate course on Real-time Systems Linköping h d 37 of 43 Autumn 2009

Implementation of VC • Rule 1: For each local event increment own entry • Rule 2: When sending message m, append to m the VC(send(m)) as a timestamp T • Rule 3: When receiving a message at node i, – increment own entry: VC[i]: = VC[i]+1 – For every entry j in the VC: Set the entry to max (T[j], VC[j]) Undergraduate course on Real-time Systems Linköping 38 of 43 Autumn 2009

Example [0, 0, 0] [1, 1, 0] [2, 1, 0] [0, 1, 0] [2, 2, 4] [0, 0, 1] [0, 0, 2] [2, 1, 3] [2, 1, 4] Undergraduate course on Real-time Systems Linköping 39 of 43 Autumn 2009

Concurrent events in VC • Relation < on vector clocks defined by: VC(a) < VC(b) iff – For all i: VC(a)[i] ≤ VC(b)[i] – For some i: VC(a)[i] < VC(b)[i] • An event a precedes another event b if VC(a) < VC(b) • If neither VC(a) < VC(b) nor VC(b) < VC(a) then a and b are concurrent Undergraduate course on Real-time Systems Linköping 40 of 43 Autumn 2009

Pros and cons • Vector clocks are a simple means of capturing “known” precedence • VC(a) < VC(b) → a b • For large systems we have resource issues (bandwidth wasted), and maintainability issues Undergraduate course on Real-time Systems Linköping 41 of 43 Autumn 2009

• Vector clocks help to synchronise at event level – Consistent snapshots • But reasoning about response times and fault tolerance needs quantitative bounds Undergraduate course on Real-time Systems Linköping 42 of 43 Autumn 2009

Distribution & Fault tolerance –Distribution introduces new complications • no global clock • richer failure models +Replication and group mechanisms • transparency in treatment of faults We will come back to faults in lecture 6, and see that synchronisation is needed for tolerating some faults Undergraduate course on Real-time Systems Linköping 43 of 43 Autumn 2009