Whats new in HTCondor Whats coming HTCondor Week
- Slides: 47
What’s new in HTCondor? What’s coming? HTCondor Week 2016 Madison, WI -- May 18, 2016 Todd Tannenbaum Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison
Release Timeline › Stable Series HTCondor v 8. 4. x - introduced Aug 2015 › › (Currently at v 8. 4. 6) Development Series HTCondor v 8. 5. 5 frozen, in beta test, release to web later this month. HTCondor v 8. 6. 0 expected summer 2016. Source: https: //www. openhub. net/p/condorproject 3
Some enhancements in HTCondor v 8. 4 › Scalability and stability Goal: 200 k slots in one pool, 10 schedds managing 400 k jobs Resolved developer tickets: 240 bug fix issues (v 8. 2. x tickets), 234 enhancement issues (v 8. 3 tickets) › › › › Docker Job Universe Tool improvements, esp condor_submit IPv 6 mixed mode Encrypted Job Execute Directory Periodic application-layer checkpoint support in Vanilla Universe Submit requirements New packaging 4
Scalability Enhancement Examples 5
Condor_shadow resources Reduce memory footprint of Shadow Eliminate need for authentication step to schedd, startd (on execute host) v 7. 8. 7: 860 KB/ 1860 KB v 8. 4. 0 386 KB 6
Authentication Speedups › FS (file system) and GSI authentication are now performed asynchronously So now a Condor daemon can perform many authentications in parallel CMS pool went from 200 execute nodes (glideins) per collector to 2000 › Can cache mapping of GSI certificate name to user name Mapping can be heavyweight, esp if HTCondor has to contact an external service (LCMAPS…) Knob name is GSS_ASSIST_GRIDMAP_CACHE_EXPIRATION 7
Faster assignment of resources from central manager to schedd › Negotiator can ask the schedd for more than one resource request per network round trip. NEGOTIATOR_RESOURCE_REQUEST_LIST_SIZE = 20 8
Impact of multiple resource requests Negotiation times for 1000 slot pool 1400 1153 1200 1000 8. 2. 8 LAN 8. 3. 5 LAN 20 reqs 800 8. 3. 5 LAN 100 reqs 600 407370 400 8. 3. 5 WAN 20 reqs 8. 3. 5 WAN 100 reqs 200 0 8. 2. 8 WAN 113 9 4 4 40 36 32 19 17 1000 # of job autoclusters 9
Eliminate CCB service pauses 10
Query Responsiveness › Improvement: Collector will not fork for queries to small tables Load Collector with 100 k machine ads Before change: ~4. 5 queries/second After change: ~24. 4 queries/second › Improvement: Schedd condor_q quantum adjusted (to 100 ms) Load schedd with 100 k jobs ads, 40 Hz job throughput Before change: ~135 seconds per condor_q After change: ~22 seconds per condor_q 11
12
Container Support (Black Box Applications) › HTCondor cgroup support now manages › swap space in addition to CPU, Memory New job universe to support Docker Containers Please talk to us if you have interest in using Docker with HTCondor! 13
Docker Universe Job Is still a job › Docker containers have the job-nature condor_submit condor_rm condor_hold Write entries to the job event log(s) condor_dagman works with them Policy expressions work. Matchmaking works User prio / job prio / group quotas all work Stdin, stdout, stderr work Etc. etc. * 14
Many condor_submit improvements You submit your jobs with that script? ? !? You’re braver than I thought! 15
More ways to Queue 'foreach' Queue <N> <N> <var> in (<item-list>) <var> matching (<glob-list>) <vars> from <filename> <vars> from <script> | › Iterate <items>, creating <N> jobs for each item › In/from/matching keywords control how we get <items> › There's more. See the manual for details. 16
Example: Queue matching files Executable = foo. exe Arguments = -inputdata $(Item) Queue 1 Item matching (*. dat, m*) › Produces a job for each file that matches › *. dat or m* (or both) $(Item) holds each filename in turn 17
Condor_q new arguments › -dag <dagman-job-id> Show all jobs in the dag › -limit <num> Show at most <num> records › -totals Show only totals › -autocluster -long Group and count jobs that have same requirements …perfect for provisioning systems 19
IPv 6 Support › New in 8. 4 is support for “mixed mode, ” › › › using IPv 4 and IPv 6 simultaneously. A mixed-mode pool’s central manager and submit nodes must each be reachable on both IPv 4 and IPv 6. Execute nodes and (other) tool-hosting machines may be IPv 4, IPv 6, or both. ENABLE_IPV 4 = TRUE ENABLE_IPV 6 = TRUE 20
Encrypted Execute Directory › Jobs can request (or admins can require) that their scratch directory be encrypted in realtime /tmp and /var/tmp output also encrypted Put encrypt_execute_directory=True in job submit file (or condor_config) › Only the condor_starter and job processes can see the cleartext Even a root ssh login / cron job will not see the cleartext Batch, interactive, and condor_ssh_to_job works 21
Periodic Application-Level Checkpointing in the Vanilla Universe › Experimental feature! › If requested, HTCondor periodically sends › › the job its checkpoint signal and waits for the application to exit. If it exits with code 0, HTCondor considers the checkpoint successful and does file transfer, and re-executes the application. Otherwise, the job is requeued. 22
Submit Requirements › Allow administrator to decide which jobs enter the queue via a SUBMIT_REQUIREMENTS constraint › Rejection (error) message may be customized 23
HTCondor RPM Packaging ›More Standard Packaging Matches OSG and Fedora package layout Built with rpmbuild Source RPM is released • Can rebuild directly from the source RPM • Build requirements are enforced by rpmbuild Partitioned into several binary RPMs • Pick and choose what you need 24
HTCondor Binary RPM Packages RPM Description condor Base package condor-all Includes all the packages in a typical installation condor-bosco BOSCO – Manage jobs on remote clusters via ssh condor-classads HTCondor classified advertisement library condor-classads-devel Development support for classads condor-debuginfo Symbols for libraries and binaries condor-externals External programs and scripts condor-externals-libs External libraries condor-kbdd HTCondor Keyboard Daemon condor-procd HTCondor Process Tracking Daemon condor-python Python Bindings for HTCondor condor-static-shadow Static Shadow (Use 32 -bit shadow on 64 -bit system) condor-std-universe Standard Universe Support condor-vm-gahp VM Universe Support 25
HTCondor Debian Packaging ›More Standard Packaging Matches debian package layout Built with pbuilder Source package is released deb Description condor Base Package condor-dbg Symbols for libraries and programs condor-dev Development files for HTCondor condor-doc HTCondor documentation libclassad-dev Development files for Classads libclassad 7 Classad runtime libaries 26
28
What to do with all these statistics? › Aggregate and send them to Ganglia! condor_gangliad introduced in v 8. 2 See manual or my talk at http: //bit. ly/1 YBBO 3 P › In addition to (or instead of) sending to Ganglia, aggregate and make available in JSON format over HTTP condor_gangliad rename to condor_metricd › View some basic historical usage out-of-the-box › by pointing web browser at central manager (modern Condor. View)… Or upload to influxdb, graphite for Grafana 29
30
Page 790 31
Enabled by default and/or easier to configure › Enabled by default: shared port, cgroups, IPv 6 Have both IPv 4 and v 6? Prefer IPv 4 for now › Configured by default: Kernel tuning › Easier to configure: Enforce slot sizes use policy: preempt_if_cpus_exceeded use policy: hold_if_cpus_exceeded use policy: preempt_if_memory_exceeded use policy: hold_if_memory_exceeded 32
New condor_q default output › Only show jobs owned by the user › Batched output (-batch, -nobatch) › Proposed new default output of condor_q will show summary of current users jobs. -- Submitter: adam Schedd: submit-3. chtc. wisc. edu OWNER IDLE RUNNING HELD SUBMITTED DESCRIPTION adam 1 - 3/22 07: 20 DAG: 221546 1 3/23 08: 57 Atlas. Anlysis 1 - 3/27 09: 37 matlab. exe 133 21 - 3/27 11: 46 DAG: 311986 JOBIDs 230864. 0 263203. 0 307333. 0 312342. 0. . . 313304. 0 In the last 20 minutes: 0 Job(s) were Completed 5 Job(s) were Started 312690. 0. . . 312695. 0 1 Job(s) were Held 263203. 0 5/11 07: 22 Error from slot 1@eee. chtc. wisc. edu: out of disk 33
New condor_q default output › Only show jobs owned by the user disable with -allusers › Batched output (-batch, -nobatch) › Proposed new default output of condor_q will show summary of current user's jobs. -- Submitter: adam OWNER adam IDLE RUNNING 1 1 133 21 Schedd: HELD 1 - submit-3. chtc. wisc. edu SUBMITTED DESCRIPTION 3/22 07: 20 DAG: 221546 3/23 08: 57 Atlas. Anlysis 3/27 09: 37 matlab. exe 3/27 11: 46 DAG: 311986 JOBIDs 230864. 0 263203. 0 307333. 0 312342. 0. . . 313304. 0 In the last 20 minutes: 0 Job(s) were Completed 5 Job(s) were Started 312690. 0. . . 312695. 0 1 Job(s) were Held 263203. 0 5/11 07: 22 Error from slot 1@eee. chtc. wisc. edu: out of disk 34
New condor_status default output › Only show one line of output per machine › Can try now in v 8. 5. 4+ with "-compact" › option The "-compact" option will become the new default once we are happy with it Machine Platform gpu-1 gpu-2 gpu-3 matlab-build mem 1 x 64/SL 6 x 64/SL 6 Slots Cpus Gpus 8 8 8 1 32 8 8 8 12 80 2 2 4 35 Total. Gb Fre. Cpu 15. 57 47. 13 23. 45 1009. 67 0 0 0 11 0 Free. Gb 0. 44 0. 57 16. 13 23. 33 160. 17 Cpu. Load ST 1. 90 1. 87 0. 85 0. 00 1. 00 Cb Cb Cb ** Cb
HTCondor and Kerberos › HTCondor currently allows you to › authenticate users and daemons using Kerberos However, it does NOT currently provide any mechanism to provide a Kerberos credential for the actual job to use on the execute slot 36
HTCondor and Kerberos/AFS › So we are adding support to launch jobs › with Kerberos tickets / AFS tokens Details HTCondor 8. 5. X to allows an opaque security credential to be obtained by condor_submit and stored securely alongside the queued job ( in the condor_credd daemon ) This credential is then moved with the job to the execute machine Before the job begins executing, the condor_starter invokes a call-out to do optional transformations on the credential 37
Grid Universe › Reliable, durable submission of a job to a remote scheduler › Popular way to send pilot jobs › Supports many “back end” types: HTCondor PBS LSF Grid Engine Google Compute Engine Amazon EC 2 Open. Stack Deltacloud Cream Nordu. Grid ARC BOINC Globus: GT 2, GT 5 UNICORE 38
120000 0 13: 00: 51 13: 06: 11 13: 11: 53 13: 16: 39 13: 22: 27 13: 28: 29 13: 33: 56 13: 39 13: 45: 36 13: 50: 46 13: 56: 11 14: 01: 52 14: 07: 26 14: 12: 57 14: 18: 54 14: 24: 00 14: 29: 56 14: 35: 55 14: 41: 01 14: 46: 35 14: 52: 37 14: 58: 20 15: 03: 58 15: 09: 55 15: 16: 13 15: 22: 08 15: 28: 12 15: 34: 26 15: 41: 14 15: 48: 23 15: 54: 25 16: 00: 16 16: 06: 14 16: 12: 11 16: 18: 15 16: 24: 17 16: 30: 15 16: 36: 16 16: 42: 39 16: 48: 46 16: 54: 40 17: 00: 31 17: 06: 18 17: 12: 02 17: 30 17: 23: 10 17: 28: 47 17: 34: 32 17: 40: 05 17: 45: 36 17: 51: 26 17: 56: 51 18: 02: 25 18: 07: 26 18: 13: 26 18: 45 18: 24: 26 18: 29: 47 Improved Scalability of Amazon EC 2 grid jobs Number of jobs running on Spot instances in Amazon AWS 100000 80000 60000 40000 20000 39
Elastically grow your pool into the Cloud: condor_annex › Leverage efficient AWS APIs such as Auto Scaling Groups and Spot Fleets Implement a “lease” so charges cease if lease expires › Secure mechanism for cloud instances to join the HTCondor pool at home institution condor_annex --set-size 2000 --lease 24 --project “ 144 PRJ 22” 40
Grid Universe support for SLURM, Open. Stack, Cobalt › Speak native SLURM protocol No need to install PBS compatibility package › Speak Open. Stack’s NOVA › protocol Speak to Cobalt Scheduler Argonne Leadership Computing Facilities 41 Jaime: Grid Jedi
Transformation of job ad upon submit › Allow admin to have the schedd add/edit job attributes upon submission ( use case: insert trusted group attributes based upon owner ) › In v 8. 5. 1+ can also set attributes as immutable by the user › Prevent user from editing protected attributes with condor_qedit or chirp 42
Docker Universe Enhancements › Docker jobs get usage updates (i. e. › network usage) reported in job classad Admin can additional volumes That all docker universe jobs get Why? • CVMFS • Large shared data Details https: //htcondorwiki. cs. wisc. edu/index. cgi/tktview? tn=5308 43
Potential Future Docker Universe Features? › Advertise images already cached on › › › machine ? Support for condor_ssh_to_job ? Package and release HTCondor into Docker Hub ? Network support beyond NAT? Run containers as root? ? !? !? Automatic checkpoint and restart of containers! (via CRIU)
SELinux and systemd › SELinux (On by default in RHEL 7) › Systemd Integration Port Reservation - Systemd will reserve 9618 for HTCondor Watchdog - If masters stops responding, systemd will restart it Status messages - display via systemctl status Logging - Daemon log messages can go to systemd-journald 45
Draining jobs from execute nodes › Add ability to backfill with pre-emptable jobs while draining Specifically, ability to specify a new startd START expression when entering drain state › Add ability to shutdown when fully drained Alternative to condor_off -peaceful › Investigating ability to upgrade HTCondor on execute nodes without restarting jobs 46
DAGMan Improvements Splice Pin connections Allows more flexible parent/child relationships between nodes within splices Parsed when DAGMan starts up INCLUDE directive Set Class. Ad attributes in DAG Set Batch Name
Seeking ideas to help users and admins learn › Move HOWTO › › recipes on wiki to stackoverflow? Sub-reddit instead of email list? You. Tube videos? 48
Smarter and Faster Schedd › User accounting information moved into ads in the Collector Enable schedd to move claims across users › Non-blocking authentication, smarter › updates to the collector, faster Class. Ad processing Late materialization of jobs in the schedd to enable submission of very large sets of jobs More jobs materialized once number of idle jobs drops below a threshold (like DAGMan throttling) 49
Thank You! P. S. Interested in working on HTCondor full time? Talk to me! We are hiring! htcondor-jobs@cs. wisc. edu 50
- Htcondor week
- Htcondor week 2022
- Htcondor week
- Ezsubmit
- Htcondor tutorial
- Dagman
- Htcondor vs slurm
- Htcondor dagman
- Week by week plans for documenting children's development
- What is always coming but never arrives
- Growing pains for the new nation
- Coming down the pike
- Thank you ladies grammar
- Larkin trees are coming into leaf
- Definition of personification
- The selfish giant every afternoon
- The second coming 27
- The coming kingdom andy woods
- Tritone maria west side story
- Knapp's relationship escalation model
- Holy spirit coming down
- Mine eyes have seen the glory
- Highly exposed to and actively using media.
- Someone is coming soon
- Jesus is coming soon revelation
- Its friday but sundays coming
- I lift my hands to the coming king
- Larkin trees are coming into leaf
- How will a school crossing patrol signal you to stop
- Coming of age themes
- The wrath of grendel setting
- Poem things fall apart
- Whats a thematic statement
- The center cannot hold shmoop
- The coming kingdom andy woods
- The coming age of calm technology
- Lesson note on the coming of the holy spirit
- Baal ishtar
- In 1840 we took a little trip
- Homecoming bruce dawe analysis
- Joining us today
- Chapter 2 lesson 2 uniting for independence answer key
- Bildungsroman coming of age
- Chapter 5 lesson 3 guided reading
- Who may abide the day of his coming
- Pyramid tournaments
- Preparation for the second coming
- Thank you for attending school open house