Atlas An Infrastructure for Global Computing People n

Atlas: An Infrastructure for Global Computing

People n Eric Baldeschwieler (UC Berkeley) n Bobby Blumofe (UT Austin) n Eric Brewer (UC Berkeley)

Outline Introduction n Programming model n Architecture n Examples n Discussion n Limitations & Conclusion n

Introduction Properties of a Internet computing infrastructure n Scalability: to 106 nodes n Heterogeneity: of machines & OSs n Fault tolerance: completion probability comparable to sequential program n Adaptive parallelism: dynamic set of resources

Properties. . . n Safety: Hosts must be secure n Anonymity: Secure privacy of client: data & program n Hierarchy: Locality of communication (local bandwidth typically is higher) n Ease of use: Minimize “costs” of participating. n Reasonable performance: Low overhead Benefit from a small set of machines.

Introduction. . . n Atlas combines mechanisms from: – Cilk – Java – with new mechanisms. n Java “ensures”: – heterogeneity – safety

Introduction. . . Atlas: n extends Cilk’s work-stealing scheduler to a hierarchical Internet setting n uses Cilk-NOW’s mechanisms for: – adaptive parallelism – fault tolerance

Programming Model n Applications are written in Java n When a native library is used, heterogeneity is limited to platforms that support it. n Programming model is: – a Java-based implementation of Cilk: n Non-blocking, explicit continuation passing threads – a Unix-like URL-based file system & local caching with coherence.

Architecture Basic architecture Client Compute Server Manager Application (Java) Compute Server Runtime library Java interpreter Native libraries (C or C++)

Architecture. . . n Client is a Java application – connects to compute servers on machines other than its manager’s. n Idle servers steal work from busy ones.

Architecture n Compute server: – relinquishes control when there is non-Atlas work (a screensaver? ) – Runs as a daemon: n working n pings manager & siblings for work to steal

Architecture: Porting Atlas n A Java runtime system n Port: – natively written URL-based file system – some support routines.

Hierarchical Work Stealing Manager Compute Server Manager Compute Server

Hierarchical Work Stealing. . . n Manager keeps track of when its subtree is idle n If manager’s subtree is idle, manager steals work from its siblings n If a subtree has “too much” work, it “allows” work stealing from above What is definition & implementation of “too much”?

Hierarchical Work Stealing n The authors claim that proven properties of Cilk hold in this hierarchical setting. n Goals: – Localize communication – Sub-trees map to domain hierarchy Administrators can control thread migration: – Outflow: Privacy – Inflow: Host security

Examples Fib: fine grained threads n POV-Ray: coarse grained threads n Base 1 Node 3 Nodes 8 Nodes Fib (24) 1. 3 40 (2. 0) 31 (2. 6) POV-Ray 20700 21000 - 2700 (7. 8) 80 Numbers in ( ) are speedups over 1 -node case.

Examples. . . n POV-Ray is not written in Java n Partitioning is done in Java n 8 nodes: only 2% overhead. n What about larger P?

Discussion n Scalable: Yes. n Heterogeneity: Incomplete until divorces itself from all native libraries. n Safety: – Java: OK. – Native libraries: ?

Discussion. . . n Fault tolerance: A timed out thread is recomputed from a checkpoint maintained by subtree (manager? ) – What is affect on performance of checkpointing? Subtree rooted at a thread is its subcomputation.

Fault Tolerance. . . Subcomputations are transactions: n Authors claim: side effects can be undone n How does this relate to hierarchical work stealing?

Discussion. . . n Anonymity: A host executing a stolen subtree cannot determine client. – Managers are assumed to be trustworthy n Hierarchy: Yes, via manager hierarchy. n Ease of use: Interface incomplete. – clients submit jobs via a special “shell”

Discussion. . . n Adaptive parallelism: – “Owner” (? ) of compute server sets a policy that defines when server is idle. – How? – When compute server becomes unavailable for Atlas work, all its sub-computations are moved to another computer server.

Adaptive Parallelism. . . n Moving a subcomputation requires updating information linking subcomputation to its: – parent – children – How long does it take to retreat? – Is sub-computation restarted? From checkpoint?

Limitations n Atlas inherits tree-structured program limitation from Cilk. – But this is still a rich set! n Generalizing to non-tree-structured programs seems hard. n No shared variables among threads. n Global file system is read-only.

Conclusion n Jicos n Use design goals = those for Atlas. JXTA to give Jicos a “file system” – Then, Jicos becomes Atlas’s heir.
- Slides: 25