Distributed Programming in Scala with APGAS Philippe Suter
Distributed Programming in Scala with APGAS Philippe Suter, Olivier Tardieu, Josh Milthorpe IBM Research Picture by Simon Greig
APGAS - Context Asynchronous Partitioned Global Address Space • Model for concurrency + distribution in X 10. • X 10, general purpose language – Developed at IBM Research for 10+ years. – Focus/bias towards distributed HPC tasks. – JVM + native back-ends (through Java & C++). – Some X 10 apps ran on >50 K cores. http: //x 10 -lang. org and X 10’ 15 @ PLDI (tomorrow)
APGAS in Scala • Goal: expose the concurrent/distributed core of X 10 as a library. – In Java 8 and as a Scala DSL. • This contribution: – Introduction to programming w/ APGAS in Scala. – Illustrated through two benchmarks: • K-means clustering • Unbalanced Tree Search (see paper) – Contrasting model with Akka (see paper). – Preliminary experimental scaling results.
APGAS Primer • Concurrent tasks run at distributed places. • The environment exposes the available places. def places : Seq[Place] def here : Place • Tasks can be remote or local. • Tasks are asynchronous by default. def async. At(p : Place)(body: =>Unit) : Unit def async(body: =>Unit) : Unit
APGAS Primer • The termination of tasks is controlled by the finish construct. def finish(body: =>Unit) : Unit • Blocks until enclosed tasks have completed, including all nested tasks, local or remote. • Distributed termination is challenging, finish is a powerful contribution of APGAS.
Hello World Completes when all places have completed their task. async. At returns immediately. finish { for(p <- places) { async. At(p) { println(s“Hello from $here. ”) } } } $> … Hello from place(0). place(3). place(1). place(2).
“Academic” Fibonacci finish guards a single async. At… finish completes exactly when the computation of all dependencies is complete. def fibonacci(i: Int) : Long = { if(i <= 1 ) i else { var a, b = 0 L finish { async { a = fibonacci(i – 2) } b = fibonacci(i – 1) } …but recursive a + b invocations } enclose many more. }
Messages and Memory • Default mechanism for transferring memory between places is to capture it in the closure of the body of async. At. • APGAS lets the programmer define global symbols for memory local to places. class Worker(…) extends Place. Local
Place-local Objects • All instances of Place. Local resolve to objects that are place-specific. class Worker(…) extends Place. Local val w : Worker = Place. Local. for. Places(places) { new Worker(…) One distinct instance is } created at each place. for(p <- places) { async. At(p) { w. work() } } Here, w resolves to the worker at place p.
Global and Shared References • For objects that cannot extend Place. Local, APGAS provides a wrapper (“pointer”) trait Global. Ref[T] { def apply(): T } • Shared references refer to an object at a particular place and can only be dereferenced there. – Useful to “call back” from an asynchronous task. trait Shared. Ref[T] { def apply(): T }
Global and Shared References // at place p 1 val large. Array : Array[Double] = … val ref = Shared. Ref. make(large. Array) async. At(p 2) { Dereferencing ref() here would be an error. … async. At(p 1) { val array = ref() array(…) = … } Dereference at p 1 resolves to large. Array. … } large. Array is never captured, therefore never serialized.
Distributed K-means Clustering • Goal: iteratively divide a set of points into K disjoint clusters. • Distribute the points among workers. • In each iteration: – workers: • computes the new centroids for their own points. • communicate their view of the centroid to the master – the master: • aggregates all workers’ data and checks convergence
Distributed K-Means: Memory • Each worker needs to hold: – Its set of points. – Its local view of centroids. • In addition, the master holds: – The aggregated centroids. Global. Ref[Worker. Data] Shared. Ref[Master. Data] • In our implementation, the workers write their results directly at the master’s. – Requires synchronized data structure.
Distributed K-Means: Structure while(!converged) { finish { for(p <- places) { async. At(p) { // compute new local centroids async. At(master. Ref. home()) { // merge local centroids in master } } }
Unbalanced Tree Search • Counts nodes in a dynamically generated tree. • Each node: – Has an associated SHA 1 hash. – Has a number of children determined by a probabilistic law. • Trees are unbalanced in an unpredictable but deterministic way.
Unbalanced Tree Search • Algorithm combines work-stealing and workdealing among workers. • Workers are modeled as state machines. • Termination: – in APGAS: a single, top-level finish. – in Akka: requires a counting protocol.
APGAS Implementation • APGAS implementation: – ~2000 lines Java 8 – ~200 lines Scala (definitions, helpers, serialization) • Tasks are scheduled using fork/join. • Distribution built on top of Hazelcast. • Benchmarks are ~1200 Scala lines – 1/3 APGAS, 1/3 Akka, 1/3 common.
Performance Evaluation • For both benchmarks, we ran a fixed problem using 1, 2, 4, 8, 16, and 32 workers. • Measured “unit of work” per second per worker. • All experiments ran on single 48 core machine. – Akka benchmarks use akka-remote.
Performance Evaluation • Experiments are meant to: – be a sanity check, – provide evidence of scalability potential. • Please do not interpret as claim that X is better than Y. “Comparable performance and scalability for comparable complexity. ”
K-Means Iterations/second/worker 0. 48 0. 46 0. 44 0. 42 APGAS Akka 0. 4 0. 38 0. 36 0. 34 0 5 10 15 20 Number of workers 25 30 35
Unbalanced Tree Search Million of nodes/second/worker 9. 6 9. 4 9. 2 APGAS 9 Akka 8. 8 8. 6 8. 4 0 5 10 15 20 Number of workers 25 30 35
Conclusion • Made APGAS programming problem accessible to Scala programmers. • Programming style is different, but a good fit for some problems. • In particular, finish concisely solves hard distributed termination problems. • Complexity is similar to equivalent Akka impls. • Promising preliminary scaling results.
Thank you!
- Slides: 23