Implementing a Central Quill Database in a Large

Implementing a Central Quill Database in a Large Condor Installation Preston Smith psmith@purdue. edu Condor Week 2008 - April 30, 2008

Overview • Background – Boiler. Grid • • • Motivation What works well What has been challenging What just doesn’t work Future directions

Boiler. Grid • Purdue Condor Grid (Boiler. Grid) – Comprised of Linux HPC clusters, student labs, machines from academic department, and Purdue regional campuses • 8900 batch slots today. . • 14, 000 batch slots in a few weeks • 2007 - Delivered over 10 million CPU-hours to high-throughput science to Purdue and national community through Open Science Grid and Tera. Grid

Boiler. Grid - Growth

Boiler. Grid - Results

A Central Quill Database • Condor 6. 9. 4, – Quill can store information about all the execute machines and daemons in a pool – Quill now able to store job history and queue contents in a single, central database. • Since December 2007, we’ve been working to store the state of Boiler. Grid in a Quill installation

Motivation • Why would we want to do such a thing? ? – Research into the state of a large distributed system • Several at Purdue, collaborators at Notre Dame – Failure analysis/prediction, smart scheduling, interesting reporting for machine owners – “events” table useful for user troubleshooting? – And one of our familiar gripes - usage reporting • Structural biologists (see earlier today) like to submit jobs from their desks, too • How can we access that job history to complete the picture of Boiler. Grid’s usage?

The Quill Server • Dell 2850 – 2 x 2. 8 GHz Xeons (hyperthreaded) – Postgres on 4 -disk Ultra 320 SCSI RAID-0 – 5 GB RAM

What works well • Getting at usage data! quill=> select distinct scheddname, owner, cluster_id, proc_id, remotewallclocktime from jobs_horizontal_history where scheddname LIKE '%bio. purdue. edu%' LIMIT 10; scheddname | owner | cluster_id | proc_id | remotewallclocktime ------------+---------+-------------------epsilon. bio. purdue. edu | jiang 12 | 276189 | 0 | 345 epsilon. bio. purdue. edu | jiang 12 | 280668 | 0 | 4456 epsilon. bio. purdue. edu | jiang 12 | 280707 | 0 | 1209 epsilon. bio. purdue. edu | jiang 12 | 280710 | 1197 epsilon. bio. purdue. edu | jiang 12 | 280715 | 0 | 1064 epsilon. bio. purdue. edu | jiang 12 | 280717 | 0 | 567 epsilon. bio. purdue. edu | jiang 12 | 280718 | 0 | 485 epsilon. bio. purdue. edu | jiang 12 | 280720 | 480 epsilon. bio. purdue. edu | jiang 12 | 280721 | 0 | 509 epsilon. bio. purdue. edu | jiang 12 | 280722 | 0 | 539 (10 rows)

What works, but is painful • Thousands of hosts pounding a Postgres database is non-trivial – Be sure to turn down QUILL_POLLING_PERIOD • Default is 10 s - we went down to 1 hour on execute machines – At some level, this is an exercise in tuning your Postgres server. top - 13: 45: 30 up 23 days, 19: 59, 2 users, load average: 563. 79, 471. 50, 428. Tasks: 804 total, 670 running, 131 sleeping, 3 stopped, 0 zombie Cpu(s): 94. 6% us, 2. 9% sy, 0. 0% ni, 0. 0% id, 0. 0% wa, 0. 4% hi, 2. 2% si Mem: 5079368 k total, 5042452 k used, 36916 k free, 10820 k buffers Swap: 4016200 k total, 68292 k used, 3947908 k free, 2857076 k cached • Quick diversion into Postgres tuning 101. .

Postgres • Assuming that there’s enough disk bandwidth…. – In order to support 2500 simultaneous connections, one must turn up max_connections – If you turn up max_connections, you need ~400 bytes of shared memory per slot. • Currently we have 2 G of shared memory allocated

Postgres • Then you’ll need to turn up shared_buffers – 1 G currently WARNING: relation "public. machines_vertical_history" contains more than "max_fsm_pages" pages with useful free space HINT: Consider compacting this relation or increasing the configuration parameter "max_fsm_pages". – Don’t forget fsm_pages…

What works, but is painful • So by now we can withstand the worker nodes reasonably well • Add schedds – condor_history returns history from ALL schedds • Bug fixed in 7. 0. 2 – The execute machines create enough load that condor_q is sluggish – Added a 2 nd quill database server just for job information

What works, but is painful • If your daemons log a lot to sql. log files, but not writing to the database. . – Database down, etc – Your database is in a world of hurt while it tries to catch up. .

What Hasn’t Worked • Many Postgres tuning guides recommend a connection pooler if you need scads of connections – pgpool-II – Pgbouncer • Tried both, Quill doesn’t seem to like it – It *did* reduce load…. But, often locked up the database (idle in transaction), and didn’t get anywhere

What can we do about it? • Throw hardware at the database! – Spindle count seems ok • Not I/O bound (any more) – More memory = more connections • 16 GB? More? – More, faster CPUs • We appear to be CPU-bound now • Get latest multi-cores

What can we do about it? • Contact Wisconsin and call for rescue “Hey guys. . This is really hard on the old database” “Hmm. Let’s take a look. ”

What can Wisconsin do about it? • Todd, Greg, and probably others take a look: – Quill always hits the database, even for unchanged ads – Postgres backend does not prepare SQL queries before submitting • Being fixed, Todd is optimistic – We’ll report with the results as soon as we have them

Future Directions • Reporting for users – Easy access to statistics about who ran on “my” machines. • Mashups, web portals – Diagnostic tools to help users • Troubleshooting, etc.

The End • Questions?

Backup slides

Boiler. Grid - Results Year Pool Size 2004 1500 2005 4000 2006 6100 Jobs Hours Delivered 43, 551 346, 000 210, 717 1, 695, 000 4, 251, 981 5, 527, 000 2007 7700 9, 611, 813 9, 524, 000 117 2008 14000+ ? ? Unique Users 14 26 72 63 so far. .