Alternative system models Tudor David Outline The multicore

Outline § The multi-core as a distributed system § Case study: agreement § The

The system model • Concurrency: several communicating processes executing at the same time; •

So far – shared memory view • set of registers that any process can

Alternative model: message passing • more verbose – but – can better control how

Programming using message passing • System design – more similar to distributed systems; •

Barrelfish: All Communication Explicit • Communication – exclusively message passing • Easier reasoning: know

Barrelfish: OS Structure – Hardware Neutral Separate OS structure from hardware Machine dependent components

Barrelfish: Replicated State • No shared memory => replicate shared OS state • Reduce

Consistency of replicas: agreement protocol Implicit communication (shared memory) Locally cached data Explicit communication

First go – a blocking protocol Two-Phase Commit (2 PC) 1. broadcast Prepare 2.

Is a blocking protocol appropriate? “Latency numbers every programmer should know” Blocking agreement –

Non-blocking agreement protocols Consensus ~ non-blocking agreement between distributed processes on one out of

Multi. Paxos • Unless failed, keep same leader in subsequent rounds P P P

Does Multi. Paxos scale in a multi -core? Multi. Paxos, 3 replicas Throughput 100000

A closer look at the multi-core environment Where does time go when sending a

Can we adapt Paxos to this scenario? P P P A A A L

1 Paxos: The failure-free case 1 2 3 P P P A A A

1 Paxos: Switching the acceptor 1 2 3 P P P A A A

1 Paxos: Switching the leader 1 2 3 P P P A A A

Switching leader and acceptor P P P A A A L L L The

Latency and throughput 8, 00 E-04 Latency (seconds) 7, 00 E-04 45 clients 6,

Agreement - summary Multi-core – message passing distributed system, but distributed algorithm implementations different

Remote Direct Memory Access (RDMA) Read/Write remote memory NIC performs DMA requests Great performance

Source: A. Dragojevic Fa. RM: Fast Remote Memory (NSDI’ 14) – Dragojevic et al.

Slides: 27

Download presentation

Alternative system models Tudor David

Outline § The multi-core as a distributed system § Case study: agreement § The distributed system as a multicore 2

The system model • Concurrency: several communicating processes executing at the same time; • Implicit communication: shared memory; – Resources – shared between processes; – Communication – implicit through shared resources; – Synchronization – locks, condition variables, nonblocking algorithms, etc. • Explicit communication: message passing; – Resources – partitioned between processes; – Communication – explicit message channels; – Synchronization message channels; Whatever can be– expressed using shared memory 3

So far – shared memory view • set of registers that any process can read or write to; • communication – implicit through these registers; • problems: Scalabil Mops/s Queue ity Lock – concurrency bugs – very common; 8 Throughput 6 – scalability - not great when contention high; 4 2 0 2 10 20 40 # cores 80 4

Alternative model: message passing • more verbose – but – can better control how information is sent in the multi-core. R 1 R 2 P 1 P 2 R 3 R 4 • how do we get message passing on multicores? – dedicated hardware message passing channels (e. g. Tilera) – more common – use dedicated cache lines for 5

Programming using message passing • System design – more similar to distributed systems; • Map concepts from shared memory to message passing; • A few examples: – Synchronization, data structures: flat combining; – Programming languages: e. g. Go, Erlang; – Operating systems: the multikernel (e. g. 6

Barrelfish: All Communication Explicit • Communication – exclusively message passing • Easier reasoning: know what is accessed when and by whom • Asynchronous operations – eliminate wait time • Pipelining, batching • More scalable

Barrelfish: OS Structure – Hardware Neutral Separate OS structure from hardware Machine dependent components • Messaging • HW interface Better layering, modularity Easier porting

Barrelfish: Replicated State • No shared memory => replicate shared OS state • Reduce interconnect load • Decrease latencies • Asynchronous client updates • Possibly NUMA aware replication

Consistency of replicas: agreement protocol Implicit communication (shared memory) Locally cached data Explicit communication (message passing) Replicated state State machine replication. Total ordering of updates Agreement High availability, High scalability How should we do message-passing agreement in a multi-core? 10

Outline § The multi-core as a distributed system § Case study: agreement § The distributed system as a multicore 11

First go – a blocking protocol Two-Phase Commit (2 PC) 1. broadcast Prepare 2. wait for Acks 3. broadcast Commit/Roll back 4. wait for Acks Blocking, all messages go through coordinator 12

Is a blocking protocol appropriate? “Latency numbers every programmer should know” Blocking agreement – L 1 cache reference only as fast as the slowest participant Branch mispredict 5 ns L 2 cache reference 7 ns Mutex lock/unlock • Scheduling? • I/O? 0. 5 ns Main memory reference Compress 1 K bytes Send 1 K bytes over 1 Gbps network 25 ns 100 ns 3 000 ns 10 000 ns Read 4 K randomly from SSD 150 000 ns Read 1 MB sequentially from memory 250 000 ns Round trip within datacenter 500 000 ns Use a non-blocking protocol Read 1 MB sequentially from SSD 1 000 ns 13

Non-blocking agreement protocols Consensus ~ non-blocking agreement between distributed processes on one out of possibly multiple proposed values Paxos • Tolerates non-malicious faults or unresponsive nodes: in multi-cores, slow cores • Needs a majority of responses to progress Lots of variations and (tolerates partitions) optimizations: Cheap. Paxos, Multi. Paxos, Fast. Paxos etc. Phase 1: prepare Phase 2: accept Roles: • Proposer • Acceptor • Learner Usually – all roles on a physical node (Collapsed 14

Multi. Paxos • Unless failed, keep same leader in subsequent rounds P P P A A A L L L 15

Does Multi. Paxos scale in a multi -core? Multi. Paxos, 3 replicas Throughput 100000 1000 10 100 Number of clients Multi-core Large area network Limited scalability in the multi-core environment 16

A closer look at the multi-core environment Where does time go when sending a message? 100% % of time 80% 60% Propagation etc. 40% Processing Large networks: Minimize number of rounds 20% 0% Multi-core < 1 us Multi-core: Minimize the number of messa LAN ~100 us 17

Can we adapt Paxos to this scenario? P P P A A A L L L Replication of service (availability): Advocate client commands Resolve contention between proposers, short-term memory (reliability, availability) Replication of data (reliability): Long-term memory Using one acceptor significantly reduces the number of messages 18

1 Paxos: The failure-free case 1 2 3 P P P A A A L L L 1. P 2: obtains active acceptor A 1 and sends prepare_request(pn) 2. A 1: if pn -> max. proposal received, replies to P 2 with ack 3. P 2 -> A 1 accept_request(pn, value) 4. A 1 broadcasts value to learners Common case: only steps 3 and 4 19

1 Paxos: Switching the acceptor 1 2 3 P P P A A A L L L 1. P 2 leader? 2. Paxos. Utility: P 2 proposes • A 3 active acceptor • Uncommitted 3. P 2 -> A 3: values proposed prepare_request 20

1 Paxos: Switching the leader 1 2 3 P P P A A A L L L 1. A 1 – active acceptor? 2. Paxos. Utility: P 3 new leader and A 1 active acceptor 3. P 3 -> A 1: prepare_request 21

Switching leader and acceptor P P P A A A L L L The trade-off: while leader and active acceptor nonresponsive at the same time ✖ liveness ✔safety Why is it reasonable? • small probability event • no network partitions • if nodes not crashed, but slow -> system becomes responsive 22

Latency and throughput 8, 00 E-04 Latency (seconds) 7, 00 E-04 45 clients 6, 00 E-04 3 replicas 5, 00 E-04 4, 00 E-04 2 PC Multi. Paxos 3, 00 E-04 7 clients 2, 00 E-04 1 Paxos 6 clients 13 clients 1, 00 E-04 0, 00 E+00 0 20000 40000 60000 80000100000120000140000 Throughput (updates/second) Smaller # of messages - smaller latency and increased throughput 23

Agreement - summary Multi-core – message passing distributed system, but distributed algorithm implementations different Agreement in multi-cores. Use one acceptor: 1 P • reduced latency • non blocking • reduced # of messages • increased throughpu 24

Outline § The multi-core as a distributed system § Case study: agreement § The distributed system as a multicore 25

Remote Direct Memory Access (RDMA) Read/Write remote memory NIC performs DMA requests Great performance Bypasses the kernel Bypasses the remote CPU Source: A. Dragojevic 26

Source: A. Dragojevic Fa. RM: Fast Remote Memory (NSDI’ 14) – Dragojevic et al. 27