What is System Research Why does it Matter

Outline l. A perspective of computer system research ¡ By Roy Levin (Manager Director

What is Systems Research? l What makes it research? ¡ ¡ l Something no

What is Systems Research? l What specialties does it encompass? Computer Architecture Operating Systems

What’s Different About Systems Research Now? l Scale! ¡ Geographic l extent “Machine room”

What’s hot today, and will continue to be? l l Distributed system! Web 2.

SRG Research focus The theory & practice of distributed system research problems The brain:

Summary of Results (last 9 months) Branch BSR System Tool Publications Transfer patents “beauty

Some projects in SRG l Building large-scale system ¡ Bit. Vault, Wi. DS and

Bit. Vault and Wi. DS Plus contribution from BSR

Bit. Vault: brick-based reliable storage for huge amount of reference data SOMO monitor Load

Bit. Vault: repair speed Repair rate vs # of servers Repair rate(MB/s) 5 8.

The “black art” of building a dist. system Pseudo-code/Protocol spec (TLA+, Spec#, SPIN) Small-scale

Goal: a generic toolkit for integrated distributed system/protocol development l Reduce debug pain ¡

Wi. DS: an API set for programmer Protocol Instance Timer Msg. Handler Application logic

Wi. DS: as development environment One address space Protocol Instance Timer Msg. Handler Wi.

Wi. DS: as deployment environment Protocol Instance Timer Msg. Handler Wi. DS-Comm Network l

The Wi. DS-enabled process SPEC • No code divergence • Large scale study •

What makes a storage system reliable? l MTTDL: Mean Time To Data Loss “After

Sequential Placement l Pro: Low likelihood of data loss when concurrent failures occur

Repair in Sequential Placement l Con: Low parallel repair degree leading to relatively high

Random Placement l Con: sensitive to concurrent failures

Repair in Random Placement l Pro: High parallel repair degree leading to low likelihood

Comparison Random placement is bad with small object size Random placement is better with

ICDCS’ 05 result summary Established the first framework of object placement’s impact on reliability

Ongoing work: l More on object placement: ¡ ¡ l We are looking at

Maze File Sharing System l The largest in China ¡ ¡ l On CERNET,

Rare System for Academic Studies WORLD’ 04: system architecture l IPTPS’ 05: The “free-rider”

Maze Architecture: the Server Side l Just like Napster… ¡ Historic reason: P 2

Maze：Incentive Policies l l New users: points == 4096 Point change: ¡ Uploads: +1.

Collusion Behavior in Maze (Partial Result) 221, 000 pairs whose duplication degree > 1

The Ecosystem of Maze Research Problem/ Application System 0 log Model 0 deploy Simulation/

Knowing the Gap is Often More Important (Or you risk falling off the cliff!)

What is Grid? Or, the Problem that I see l Historically associated with HPC

What is Grid? (cont. ) l Grid <= HPC + Web Service ¡ l

P 2 P computing: inspiration from Cellular Automata [A New Kind of Science, Wolfram,

Many Applications Follow the Same Model l Enterprise computing Map. Reduce and similar tasks

Is it feasible? Consider: 2 D CA; LAN and WAN N N WAN/LAN does

What’s the point of Grid, or not-Grid? Copy ready concepts across context is easy

Slides: 43

Download presentation

What is System Research? Why does it Matter? Zheng Zhang Research Manager System Research Group Microsoft Research Asia

Outline l. A perspective of computer system research ¡ By Roy Levin (Manager Director of MSR-SVC) l Overview l Example of MSRA/SRG activities projects

What is Systems Research? l What makes it research? ¡ ¡ l Something no one has done before. Something that might not work. Universities are known for doing Systems Research, but Industry does it too: ¡ ¡ ¡ VAX Clusters (DEC) Cedar (Xerox) System R (IBM) Non. Stop (Tandem) And, more recently, Microsoft

What is Systems Research? l What specialties does it encompass? Computer Architecture Operating Systems Programming Languages Distributed Applications Security Simulation (and others. . . ) l Networks Protocols Databases Measurement Design + implementation + validation ¡ (implementation includes simulation)

What’s Different About Systems Research Now? l Scale! ¡ Geographic l extent “Machine room” systems are a research niche. ¡ Administrative l Validation extent (multiple domains) at scale ¡ Single organization often can’t do it. ¡ Consortia, collaborations ¡ “Test beds for hire” (e. g. , Planet. Lab) ¡ Industrial systems as data sources l And perhaps as test beds?

What’s hot today, and will continue to be? l l Distributed system! Web 2. 0 is all about distributed system: ¡ Protocol/presentation revolution: l ¡ Service mash-up is really happening ¡ Infrastructure: l l l Html but XML, DHTML AJAX, http RSS/Atom… Huge cluster inside (e. g. MSN, Google, Yahoo) Even bigger network outside (e. g. P 2 P, social network) Very complex to understand ¡ Fertile ground for advancing theory/algorithmic aspect l Very challenging to build and test l That’s what research is about, isn’t it?

MSRA/SRG Activity Overview

SRG Research focus The theory & practice of distributed system research problems The brain: Basic research “Beauty contest” problems “Practical” theory work: -Failure model -Membership protocol -Distributed data structure -DHT spec -… The hand: Systems/tools solutions “Inspector Morse” Systems The tools improve Exploratory systems/experiments Large-scale wide-area p 2 p file sharing - Maze use Low hanging fruits Machine room storage system and its applications -Bit. Vault -HPC & end user Distributed system building package -Wi. DS

Summary of Results (last 9 months) Branch BSR System Tool Publications Transfer patents “beauty contest” ICNP’ 05 (ZRing) N/A 1 “detective” ICDCS’ 05 (object placement) MSN Windows 2 Exploratory (Maze) IPTPS’ 04 USENIX WORLDS’ 05 N/A Low-hanging (Bit. Vault) FAST’ 05 (submitted) Machine Bank MSN Windows 1 Wi. DS Hot. OS’ 05 MASCOTS’ 05 Windows 3

Some projects in SRG l Building large-scale system ¡ Bit. Vault, Wi. DS and BSR involvement l Large-scale P 2 P system ¡ A collaboration project with Beijing University l My view on Grid and P 2 P computing l Which one would you like to hear?

Bit. Vault and Wi. DS Plus contribution from BSR

Bit. Vault: brick-based reliable storage for huge amount of reference data SOMO monitor Load balance Check-out Check-in Catalog delete Soft-state dist. index Object replication & placement Repair Protocol Membership and Routing Layer (MRL) Scalable broadcast Anti-entropy Leafset protocol $400 a piece Design points l Low TCO l Highly reliable l Simple architecture • Entirely developed/maintained with Wi. DS • Adequate performance

Bit. Vault: repair speed Repair rate vs # of servers Repair rate(MB/s) 5 8. 2 10 22. 8 15 46. 5 20 49. 4 25 86. 4 30 87. 8 Performance under failure Google File System: 440 MB/s in a 227 -server cluster F Comparable or better l

The “black art” of building a dist. system Pseudo-code/Protocol spec (TLA+, Spec#, SPIN) Small-scale simulation Implementation (v. 001) Debug Implementation (v. 01) -Unscalable “distributed” human “log mining” -Non-deterministic bugs Debug Implementation (v. 1) Performance debug - Code divergence - 1/1000 of the real deployment scale (esp. P 2 P)

Goal: a generic toolkit for integrated distributed system/protocol development l Reduce debug pain ¡ ¡ l Eliminate code divergence ¡ l Use the same code across development stages (e. g. simulation/executable) Scale the performance study ¡ l Spend as much energy in single address space as possible Isolate non-deterministic bugs and reproduce them in simulation Remove the human body in the log mining business … Implement efficient ultra-large scale simulation platform Interface with formal methods ¡ Ultimately: TLA+ spec Implementation

Wi. DS: an API set for programmer Protocol Instance Timer Msg. Handler Application logic • State machine • Event-driven • Object-oriented • Periodic and one-timer • Post. Message, async callback Isolate implementation from runtime

Wi. DS: as development environment One address space Protocol Instance Timer Msg. Handler Wi. DS-Dev Eventwheel Network Model Single address space debugging of multiple instances l Small scale simulation (~10 K) l

Wi. DS: as deployment environment Protocol Instance Timer Msg. Handler Wi. DS-Comm Network l Protocol Instance Ready to run! Wi. DS-Comm

The Wi. DS-enabled process SPEC • No code divergence • Large scale study • Virtualize dist. system debugging process Protocol Wi. DS starts here Wi. DS-Dev Implementation Debug Performance Wi. DS-Par Evaluation (include large scale) Deployment Wi. DS-Comm Optimization Wi. DS-par has been used to test 2 Million real instances Using 250+ PCs

What makes a storage system reliable? l MTTDL: Mean Time To Data Loss “After the system is loaded with data objects, how long on average the system can sustain before it permanently loses the first data object. ” l Two factors: ¡ Data repair speed ¡ Sensitivity to concurrent failures

Sequential Placement l Pro: Low likelihood of data loss when concurrent failures occur

Repair in Sequential Placement l Con: Low parallel repair degree leading to relatively high likelihood of concurrent failures

Random Placement l Con: sensitive to concurrent failures

Repair in Random Placement l Pro: High parallel repair degree leading to low likelihood of concurrent failures

Comparison Random placement is bad with small object size Random placement is better with large object size MTTF=1000 days, B=3 GB/s, b=20 MB/s, c=500 GB, user data 1 PB

ICDCS’ 05 result summary Established the first framework of object placement’s impact on reliability l Upshot: l ¡ ¡ Spread your replicas as widely as you can Up to the point when BW is fully utilized for repair l l More than that will hurt reliability Core algorithm adopted by many MSN large scale storage projects/products

Ongoing work: l More on object placement: ¡ ¡ l We are looking at system with extremely longevity Heterogeneity capacity and other dynamics have not been factored in Improving Wi. DS further: ¡ ¡ It is still so hard to debug!! Idea: l l Replay facility to take logs from deployment Time-travel inside simulation Use model and invariance checker to identify fault location and path See SOSP’ 05 poster

Maze With Beijing Univ Maze Team

Maze File Sharing System l The largest in China ¡ ¡ l On CERNET, popular with college students Population: 1. 4 million registered accounts; 30, 000+ online users More than 200 million files More than 13 TB (!) transfer everyday Completely developed, operated and deployed by an academic team ¡ ¡ Logs added since the collaboration w/ MSRA last year Enable detailed study at all angles

Rare System for Academic Studies WORLD’ 04: system architecture l IPTPS’ 05: The “free-rider” problem l AEPP’ 05: Statistics of shared objects and traffic pattern l Incentive to promote sharing collusion and cheating l Trust and fairness Can we defeat collusion? l

Maze Architecture: the Server Side l Just like Napster… ¡ Historic reason: P 2 P sharing add-ons for T-net FTP search engine Not a DHT!

Maze：Incentive Policies l l New users: points == 4096 Point change: ¡ Uploads: +1. 5 points per/MB Downloads: at most -1. 0 point/MB ¡ Gives user more motivation to contribute ¡ l Service differentiation: ¡ Order download requests by T = Now – 3 log(Point) l ¡ l First-come-first-serve + large-points-first-serve Users with P < 512 have a download band-width of 200 Kb/s Available in Maze 5. 0. 3; extensively discussed in Maze forum before implemented

Collusion Behavior in Maze (Partial Result) 221, 000 pairs whose duplication degree > 1 l the top 100 links with most redundant traffic The first ever study of such kind ¡ ¡ Modeling, simulation done Deployed and measurement in two months

The Ecosystem of Maze Research Problem/ Application System 0 log Model 0 deploy Simulation/ Development log Model 1 ? l Common to all system research ¡ More difficult in a live system: you can’t go back!

Grid and P 2 P Computing

Knowing the Gap is Often More Important (Or you risk falling off the cliff!) The gap is often a manifest of LAW (speed of light) l The gap between Wide-Area (GRID) and cluster/HPC can be just as wide as between HPC and sensor network l l Many impossibility theory exist ¡ ¡ ¡ Negative results are not a bad thing The bad thing is that many are unaware! Example: l l The impossibility of consensus in asynchronous network The impossibility of achieving consistency, availability and partition resilient simultaneously

What is Grid? Or, the Problem that I see l Historically associated with HPC ¡ l Problematically borrowing concept from an environment governed by a different law ¡ ¡ l Internet as a grand JVM unlikely Need to extract common system infrastructure after gaining enough application experiences Sharing and collaboration are labels applied without careful investigation ¡ l Dressed up when running short on gas Where/what is the 80 -20 sweet-spot? Likewise, adding the P 2 P spin should be carefully done

What is Grid? (cont. ) l Grid <= HPC + Web Service ¡ l HPC isn’t done yet; check google Why? You need HPC to run the apps, or store the data ¡ Service has clear boundary ¡ Interoperable protocol bound the services ¡

P 2 P computing: inspiration from Cellular Automata [A New Kind of Science, Wolfram, 2002] Program Computation Similar to traditional parallel computing logic: read input data for a while { compute output data region input edge regions }

Many Applications Follow the Same Model l Enterprise computing Map. Reduce and similar tasks in data processing ¡ Sorting and querying ¡ l Coarse-Grain Scientific Computing Engineering and product design ¡ Meteorological simulation ¡ Molecular biology simulation ¡ Bioinformatics computation ¡ l Hunting for the next low-hanging fruits after seti@home, folding@home

Is it feasible? Consider: 2 D CA; LAN and WAN N N WAN/LAN does not Matter when C/B is large More processes matter! n n LAN WAN N n C 0/B 0 N>>n Computing Density /Traffic (instr/byte)

What’s the point of Grid, or not-Grid? Copy ready concepts across context is easy l But it often does not work l ¡ l Context is governed by the law of physics Need to start building and testing applications ¡ Then we can define what “Grid OS” is truly about

Thanks