Metronome and The NMI Lab This subtitle included
Metronome and The NMI Lab: This subtitle included solely to steal the “longest title” award from Ewa, who thought she won it this morning with, “Pegasus and DAGMan: From Concept to Execution Mapping Scientific Workflows onto the National Cyberinfrastructure” Peter Couvares Computer Sciences Department University of Wisconsin-Madison pfc@cs. wisc. edu
Decision Time › Past • Quick Review: why, what, who › Present • Current status, new this year › Future • Future plans, new next year Condor. Project. org
Why: The Problem › Good distributed computing (“grid”) software is… • badly needed • hard to find • hard to build and test Condor. Project. org
The Fix (Part of it, anyway) › Good build/test cycle › To be good, build/test process must be… • frequent • reliable • automatic • repeatable Condor. Project. org
The (Next) Problem › Building and testing distributed computing software requires… • Distributed resources • Not always in-house, not always dedicated to builds • I. e. , shared, scheduled resources • Unless you have a spare Blue Gene lying around… and an old Alpha running Red. Hat 7. 2… and an HPUX 11 box… and an Itanium running Scientific Linux 3 (CERN-flavored) … and… • Distributed testbeds, tests • Not: “the grid works on my machine… ship it!” Condor. Project. org
Grid Build and Test › Building and testing distributed computing software brings distributed challenges… • Complex workflows, cross-site/project/user scheduling priorities, data management, faulttolerance, failure recovery • A lot like “real” distributed computing • Tinderbox or the latest Web 2. 0 build system doesn’t cut it › Deep, integrated software stacks • Distributed providers Condor. Project. org
How We Do It › Use proven grid software to build and test › › › new grid software “Condor works, let’s use Condor” Metronome is our second-generation build/test framework built on top of Condor, DAGMan, and other distributed computing technologies NSF-funded Condor. Project. org
Metronome Principles › Tool-independent › Lightweight › Encourage explicit, well-controlled build/test › › environments Central results repository Fault-tolerance Support platform-neutral and platform-specific tasks Build/test separation Condor. Project. org
INPUT Spe c File Metronome Distributed Build/Test Pool NMI Build & Test Software Condor Queue DAG Spec File Customer Source Code DAGMa n results Customer Build/Test Scripts OUTPUT results Web Portal Finished Binaries My. SQL Results DB build/test jobs results
NMI Lab • Dedicated, heterogeneous distributed computing facility • Opposite extreme from typical “cluster” -- instead of 1000’s of identical CPUs, we have a handful of CPUs each for 50+ platforms. • Much harder to manage! You try finding a monitoring tool that works on 50 platforms! › Carefully-controlled resources • No mystery meat Condor. Project. org
The Team › Subset of the Condor Team • Becky Gietzel, master of all things NMI • Todd Miller, new guy on the block • Andy Pavlo, part-timer, short-timer • Ken Hahn, sysadmin to the stars • Me Condor. Project. org
Dogfood and Hats › Eating our own dogfood… • Condor builds failed last weekend (true!) • Condor developers complained to NMI Lab (“your build system failed… fix it!”) • NMI Lab discovered Condor bug (“hmm…”) • NMI Lab complained to Condor developers (“your software failed… fix it!”) › Feel the love! Condor. Project. org
The Past Year: What We Did on Our Summer Vacation Condor. Project. org
New Name! › Before: • NMI Build & Test System, NMI Build & Test Software, NMI Build & Test Framework, NMI Software, NMI Build & Test Lab, UW-Madison Build & Test Lab, Build & Test Lab at UW-Madison › After: • Metronome + the NMI Lab › Why? • Old names were a mouthful • Clear separation between the software framework (Metronome) and the facility (the NMI Lab) Condor. Project. org
Real Work › Extremely Productive Collaborations • Tera. Grid: production Metronome deployment using dynamically provisioned resources • ETICS, OMII: building higher-level services to generate and manage build/test jobs across an international federation of Metronome deployments › Extremely Productive Users • Condor, Tera. Grid, Open Science Grid / VDT, Globus, NCSA (My. Proxy), SDSC (SRB), LIGO, many others in this room… Condor. Project. org
New Metronome Capabilities › “Productization”, customization for other sites › Parallel testing • Enables dynamic, co-scheduled, distributed testbeds! › Automatic cross-site job migration • Run your own local Metronome pool with access to ours for exotic platforms › Many smaller features and extensions for › production users -- users drive development More bugs fixed than introduced! Condor. Project. org
New NMI Lab Capabilities › More platforms • “always with the platforms…” • new Itanium platforms, NLOTW (New Linux of the Week), additional vendor Unix machines, etc. • Now over 50 (!) platforms › Improved Lab Management • No, not me… better design and automation of systems & their administration Condor. Project. org
Future Condor. Project. org
The Plan: Metronome › “Support, maintain, enhance” • VM--I mean slot--no wait, I mean VM support • Enhanced parallel testing support • Custom testbed environments (network, etc. ) • Dynamic deployments (glide-in) • Advanced scheduling policies • Scalability testing enhancements • Better docs/installation/management Condor. Project. org
The Plan: NMI Lab › “Support, maintain, enhance” • More platforms, always with the platforms • More capacity • VM servers for… • Root-level testing • On-demand platforms • Federation with other Metronome labs • Better support, smoother management, less downtime • New sysadmin starting in June: take a bow, Ross! Condor. Project. org
You › › Want to use it? Metronome The NMI Lab http: //nmi. cs. wisc. edu/ Condor. Project. org
Feedback › When we started, the state of the art was › › unimpressive (almost non-existant)… we had to build our own More build tools now exist -- if you know & like one of them, what do you like about it? We’d like to better understand what we do well, what we don’t, and how we can integrate with other systems you find useful… Condor. Project. org
- Slides: 22