Grids and Condor Barcelona 2006 Condor Project Computer
Grids and Condor Barcelona, 2006 Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs. wisc. edu http: //www. cs. wisc. edu/condor
Agenda §Extended user’s tutorial §Advanced Uses of Condor Java programs DAGMan Stork MW Grid Computing §Case studies, and a discussion of your application‘s needs http: //www. cs. wisc. edu/condor 2
Resources § There are many resources (machines) in the world, and many are or can be made available! § Groups of machines may be labeled as grids § Welcome to the power of the grid ! http: //www. cs. wisc. edu/condor 3
Condor and Grids § Condor has always been a tool to harness grid computing § Condor’s mechanisms have evolved as technologies have evolved. Roughly categorized: § Flocking § Glidein § The grid universe http: //www. cs. wisc. edu/condor 4
Flocking • A way for jobs to run within a different, separate Condor pool • Condor runs here, and Condor runs there http: //www. cs. wisc. edu/condor 5
Connect Condor Pools with Flocking § Flocking is a Condor-specific technology § Flocking is enabled with configuration § Jobs flock from here to there when they cannot be run here due to lack of available machines http: //www. cs. wisc. edu/condor 6
Configuration § Configuration files contain lots of the administrative information used by Condor § Format is like that in submit description files: Attribute. Name = Value http: //www. cs. wisc. edu/condor 7
Configuration here § For jobs to be able to flock from here to § there In the configuration file on the pool where jobs flock from: FLOCK_TO = <central manager machine name> FLOCK_COLLECTOR_HOSTS = $(FLOCK_TO) FLOCK_NEGOTIATOR_HOSTS = $(FLOCK_TO) HOSTALLOW_NEGOTIATOR_SCHEDD = $(COLLECTOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS) http: //www. cs. wisc. edu/condor 8
Configuration there § In the configuration file on the pool where jobs flock to: FLOCK_FROM = <submit machine name>, . . . , <submit machine name> § To make security work: HOSTALLOW_WRITE_COLLECTOR = $(HOSTALLOW_WRITE), $(FLOCK_FROM) HOSTALLOW_WRITE_STARTD = $(HOSTALLOW_WRITE), $(FLOCK_FROM) HOSTALLOW_READ_COLLECTOR = $(HOSTALLOW_READ), $(FLOCK_FROM) HOSTALLOW_READ_STARTD $(FLOCK_FROM) = $(HOSTALLOW_READ), http: //www. cs. wisc. edu/condor 9
Submit Description File Enable file transfer: universe = vanilla executable = myjob. exe input = myjob. input output = myjob. output log = myjob. log should_transfer_files = YES when_to_transfer_output = ON_EXIT queue http: //www. cs. wisc. edu/condor 10
The Glidein Concept § Assume: We need more machines, and we have permission to use a set of machines § Glidein temporarily adds a set of machines to the local pool http: //www. cs. wisc. edu/condor 11
Glidein § In addition, Glidein solves the problem: “My job needs to run on that particular resource, and my job needs Condor. ” § For example: a job that must run under the standard universe http: //www. cs. wisc. edu/condor 12
Glidein § Condor sends and runs its own executables on the resource § The needed resource appears to temporarily join the local Condor pool ! http: //www. cs. wisc. edu/condor 13
Glidein run condor_glidein to add the remote resource to the local pool the master and local pool remote resource startd daemons become grid universe jobs using gt 2 http: //www. cs. wisc. edu/condor 14
Making Glidein Work § Change the configuration to give access permission § § (HOSTALLOW_WRITE) to the remote resource No changes to jobs’ submit description files! But, do enable file transfer in the submit description file: universe = vanilla executable = myjob. exe input = myjob. input output = myjob. output log = myjob. log should_transfer_files = YES when_to_transfer_output = ON_EXIT queue http: //www. cs. wisc. edu/condor 15
Force Job to Glidein Resource In the submit description file: universe = standard executable = ajob. exe input = ajob. input output = ajob. output log = ajob. log requirements = ( machine == “example. mcs. anl. gov" ) && Arch != "" && Op. Sys != "" queue http: //www. cs. wisc. edu/condor 16
The Grid Universe Most useful when 1. We want to send a job off to a far away machine 2. We want to hand a job to another batch processing system on the local machine 3. We want to send a job off to a far away machine, in order to hand that job to another batch processing system on that machine http: //www. cs. wisc. edu/condor 17
The Grid Universe § All handled in the submit description file § Supports several back end types: § Globus: GT 2, GT 3, GT 4 § Nordu. Grid § UNICORE § Condor § PBS § LSF http: //www. cs. wisc. edu/condor 18
Condor-G § Condor-G describes jobs to be handed off to a machine, and the machine is utilizing Globus middleware § gt 2: Globus Toolkit 1 or 2 or the pre-web services GRAM § gt 3: Globus Toolkit 3 § gt 4: Globus Toolkit 4 or WS GRAM http: //www. cs. wisc. edu/condor 19
Submit Description File For gt 2: One of: jobmanager-condor universe = grid jobmanager-pbs input = job 1. input jobmanager-lsf output = job 1. result jobmanager-sge log = job 1. log grid_resource = gt 2 example. wisc. edu/jobmanager queue http: //www. cs. wisc. edu/condor 20
Submit Description File XXX is one of: For gt 3: Fork Condor universe = grid input = job 2. input output = job 2. result log = job 2. log PBS LSF SGE grid_resource = gt 3 http: //198. 51. 254. 40: 8080/osga/services/base /gram/XXXManaged. Job. Factory. Service queue IP address: Port number http: //www. cs. wisc. edu/condor 21
Submit Description File For gt 4: XXX is one of: Fork universe = grid Condor PBS input = job 3. input LSF output = job 3. result SGE log = job 3. log grid_resource = gt 4 https: //198. 51. 254. 40: 8080/wsrf/ service/Managed. Job. Factory. Service XXX queue IP address: Port number OR Host name: Port number http: //www. cs. wisc. edu/condor 22
Nordugrid and the Submit Description File universe = grid input = job 4. input output = job 4. result log = job 4. log grid_resource = nordugrid ngexample. com queue http: //www. cs. wisc. edu/condor 23
Unicore and the Submit Description File vsite is the name of the Unicore virtual resource universe = grid input = job 5. input output = job 5. result log = job 5. log grid_resource = unicore usite. example. com vsite keystore_file = /frieda/certificates/keystore_alias = “frieda” keystore_passphrase_file = /frieda/private/passphrase queue http: //www. cs. wisc. edu/condor 24
PBS and the Submit Description File § Details of the PBS installation in $(GLITE_LOCATION)/etc/batch_gahp. config universe = grid input = job 6. input output = job 6. result log = job 6. log grid_resource = pbs queue http: //www. cs. wisc. edu/condor 25
LSF and the Submit Description File § Details of the LSF installation in $(GLITE_LOCATION)/etc/batch_gahp. config universe = grid input = job 7. input output = job 7. result log = job 7. log grid_resource = lsf queue http: //www. cs. wisc. edu/condor 26
Condor-C § Condor is running here, and Condor is running over there § For the case where We want to send a job off to a far away machine, in order to hand that job to another batch processing system on that machine http: //www. cs. wisc. edu/condor 27
Condor-C and the Submit Description File universe = grid input = job 8. input schedd name output = job 8. result log = job 8. log collector grid_resource = condor joe@remotemachine. example. com machine name remotecentralmanager. example. com +remote_jobuniverse = 5 vanilla universe +remote_requirements = True +remote_Should. Transfer. Files = "YES" +remote_When. To. Transfer. Output = "ON_EXIT" queue http: //www. cs. wisc. edu/condor 28
Credentials § Not just anybody can use any resource at any time. . . § Key concepts: Authentication verification of an identity Authorization permission to do something http: //www. cs. wisc. edu/condor 29
Authentication If Frieda says “I am Frieda. ”, how do we distinguish this from if Frieda says “I am George Bush. ” ? http: //www. cs. wisc. edu/condor 30
Authentication § Bush can do whatever he pleases § If Frieda claims to be Bush, (and this is accepted), then Frieda can do whatever she pleases § Authentication attempts to verify the identity of the entity that is communicating http: //www. cs. wisc. edu/condor 31
Authorization § Who is allowed (permitted) to do what § Frieda may run gt 4 jobs on the Open Science Grid machines § Fred may write to files in /usr/bin § the Unix user root may do anything! § Can be implemented with a list of those authorized http: //www. cs. wisc. edu/condor 32
Condor and Authentication within Condor comes in many forms. Here are three. 1. File system: Have the entity write a file. The OS attaches a name to the file owner. Condor checks that the entity’s claim is the same as the file owner. 2. GSI (Grid Security Infrastructure) 3. Kerberos http: //www. cs. wisc. edu/condor 33
Authentication Idea • A centralized certificate authority (CA) does verification of an entity’s identity. • When satisfied, the CA issues a signed certificate (also called a credential) CA I am Frieda http: //www. cs. wisc. edu/condor 34
Authentication • To authenticate, the entity presents the certificate • All is well, if we trust the CA and the remote machine CA I am Frieda http: //www. cs. wisc. edu/condor 35
GSI Authentication § GSI uses X. 509 certificates § Grid universe, submitting to back end types using Globus middleware (gt 2, gt 3, gt 4), as well as nordugrid, and unicore use X. 509 certificates § Condor can also use GSI http: //www. cs. wisc. edu/condor 36
Revocation, Trust, and Proxies § The CA may revoke a credential § Frieda gives the signed credential to the remote § § machine. If the remote machine is malicious, it could impersonate Frieda. Therefore, a password protects the credential. A proxy is a credential that includes the password, but is only valid for a specific (short) time period. My. Proxy software enables GSI proxy management http: //www. cs. wisc. edu/condor 37
- Slides: 37