An Introduction to Grid Computing Research at Notre

  • Slides: 22
Download presentation
An Introduction to Grid Computing Research at Notre Dame Prof. Douglas Thain University of

An Introduction to Grid Computing Research at Notre Dame Prof. Douglas Thain University of Notre Dame http: //www. cse. nd. edu/~dthain

What is Grid Computing? Grid computing is the idea that we can attack problems

What is Grid Computing? Grid computing is the idea that we can attack problems of enormous scale by harnessing lots of machines to work on one problem. When people refer to The Grid, they are imagining a future where computers all over the globe are connected in one colossal system open for use. Today, we have a variety of large, useful grids, but we don’t yet have The Grid.

Campus Scale Grids at Notre Dame ND BOB: Bunch of Boxes – – –

Campus Scale Grids at Notre Dame ND BOB: Bunch of Boxes – – – A “closet grid” of conventional PCs. 212 CPUs in Stepan Hall http: //bob. nd. edu ND Center for Research Computing – A “cluster grid” of dedicated rackmount computers downtown. – 900 CPUs in Union Station. – http: //crc. nd. edu ND Condor Pool – A “workstation grid” of classroom and desktop machines used when idle. – 405 CPUs in Fitzpatrick/Nieuwland – http: //www. nd. edu/~condor

Volunteer Grids Simple Idea: – Most computers are idle 90% of the day. –

Volunteer Grids Simple Idea: – Most computers are idle 90% of the day. – Can we harness their unused capacity for real work? Examples: – Pioneered by Condor in 1987 at the Univ Wisconsin. – Popularized by SETI@Home in 1999 at Berkeley Over 300, 000 active participants today. Successor is the more general BOINC. – Folding@Home About 200, 000 CPUs today. Makes use of GPU cards: about 100 x faster than CPU! – Xgrid: deployed with every Macintosh today. Challenge: The user must be flexible!

National Computing Grids NSF Teragrid – Open to any NSF research. – 21, 972

National Computing Grids NSF Teragrid – Open to any NSF research. – 21, 972 CPUs / 220 TB / 6 sites Open Science Grid – Open to any university. – 21, 156 CPUs / 83 TB / 61 sites Condor Worldwide: – Anyone can install a pool. – 96, 352 CPUs / 1608 sites Planet. Lab – Open to CS research sites. – 753 CPUs / 363 sites

Who Needs Grid Computing? Anyone with unlimited computing needs! High Energy Physics: – Simulating

Who Needs Grid Computing? Anyone with unlimited computing needs! High Energy Physics: – Simulating the detector a particle accelerator before turning it on allows one to understand the output. Biochemistry: – Simulate complex molecules under different forces to understand how they fold/mate/react. Biometrics: – Given a large database of human images, evaluate matching algorithms by comparing all to all. Climatology: – Given a starting global climate, simulate how climate develops under varying assumptions or events.

What are the Challenges? Why don’t we have The Grid yet? Technical Challenges: –

What are the Challenges? Why don’t we have The Grid yet? Technical Challenges: – Enforcing the wishes of all the owners. – Automatically negotiating expectations. – Limiting what resources a user can consume. – Performance and scalability. – Debugging and troubleshooting. – Managing access to data! – Making it easy to use!

An Example of a Workstation Grid at Notre Dame

An Example of a Workstation Grid at Notre Dame

I will only run jobs between midnight and 8 AM I will only run

I will only run jobs between midnight and 8 AM I will only run jobs when Computing Environment there is no-one working at the keyboard Miscellaneous CSE Workstations CPU Job CPU CPU Fitzpatrick Workstation Cluster Job CPU CPU Disk Disk Job Job Condor Match Maker Disk I prefer to run a job submitted by a CCL student. CPU CPU Job Disk Job Disk CVRL Research Cluster Disk CCL Research Cluster

CPU History Storage History

CPU History Storage History

Flocking Between Universities Wisconsin 1200 CPUs Purdue A 541 CPUs Notre Dame 300 CPUs

Flocking Between Universities Wisconsin 1200 CPUs Purdue A 541 CPUs Notre Dame 300 CPUs Purdue B 1016 CPUs http: //www. cse. nd. edu/~ccl/operations/condor/

http: //www. cse. nd. edu/~ccl/viz

http: //www. cse. nd. edu/~ccl/viz

An Example of Grid Computing Research at Notre Dame

An Example of Grid Computing Research at Notre Dame

Scalable I/O for Biometrics Computer Vision Research Lab in CSE – Goal: Develop robust

Scalable I/O for Biometrics Computer Vision Research Lab in CSE – Goal: Develop robust algorithms for identifying humans from (non-ideal) images. – Technique: Collect lots of images. Think up clever new matching function. Compare them. How do you test a matching function? – For a set S of images, – Compute F(Si, Sj) for all Si and Sj in S. – Compare the result matrix to known functions. Credit: Patrick Flynn at Notre Dame CSE

Computing Similarities 1 0 . 1 . 8 0 . 1 1 0 1

Computing Similarities 1 0 . 1 . 8 0 . 1 1 0 1 0 . 1 . 7 1 0 0 1 . 1 F 1

A Big Data Problem Data Size: 10 k images of 1 MB = 10

A Big Data Problem Data Size: 10 k images of 1 MB = 10 GB Total I/O: 10 k * 2 MB *1/2 = 100 TB Would like to repeat many times! In order to execute such a workload, we must be careful to partition both the I/O and the CPU needs, taking advantage of distributed capacity.

Conventional Solution Disk Move 200 TB at Runtime! Job Job CPU CPU Disk Disk

Conventional Solution Disk Move 200 TB at Runtime! Job Job CPU CPU Disk Disk

A More Scalable Solution 3. Jobs find nearby data copy, and make full use

A More Scalable Solution 3. Jobs find nearby data copy, and make full use before discarding. CPU Job CPU CPU Disk Disk 2. Replicate data to many disks. 1. Break array into MB-size chunks. Result: Biometric users can accomplish in three days what used to take one month!

The All-Pairs Abstraction All-Pairs: – For a set S and a function F: –

The All-Pairs Abstraction All-Pairs: – For a set S and a function F: – Compute F(Si, Sj) for all Si and Sj in S. The end user provides: – Set S: A bunch of files. – Function F: A self-contained program. Applies to lots of different problems: – Comparing proteins for interactions. – Searching documents for similarities. – Any kind of optimization problems.

An All-Pairs Facility at Notre Dame 100 s-1000 s of machines S F All

An All-Pairs Facility at Notre Dame 100 s-1000 s of machines S F All Pairs Web Portal F CPU Disk 2 – Backend decides where to run, how to partition, when to retry failures. . . 1 – User uploads S and F into the system. 3 – Return result matrix to user.

Research Opportunities Openings for undergraduate students. – Research for class credit during the year.

Research Opportunities Openings for undergraduate students. – Research for class credit during the year. – Research for paycheck during the summer. – Must enjoy programming and making things work. Some Project Ideas: – Build a easy-to-use web front-end for using a grid computing system to process biometric data. – Find a way to get data from your workstation to 500 other machines as fast as possible. – Build and manage a filesystem that ties together 500 disks at once to create one gigantic 20 TB system.

For more information. . . To learn more about Condor@ND – http: //www. nd.

For more information. . . To learn more about Condor@ND – http: //www. nd. edu/~condor Prof. Douglas Thain – dthain@nd. edu – http: //www. cse. nd. edu/~dthain – 382 Fitzpatrick Hall