HighPerformance Grid Computing and Research Networking How to

  • Slides: 23
Download presentation
High-Performance Grid Computing and Research Networking How to Use the Cluster? Presented by David

High-Performance Grid Computing and Research Networking How to Use the Cluster? Presented by David Villegas Instructor: S. Masoud Sadjadi http: //www. cs. fiu. edu/~sadjadi/Teaching/ sadjadi At cs Dot fiu Dot edu 1

Acknowledgements n The content of many of the slides in this lecture notes have

Acknowledgements n The content of many of the slides in this lecture notes have been adopted from the online resources prepared previously by the people listed below. Many thanks! ¨ Henri Casanova n n n Principles of High Performance Computing http: //navet. ics. hawaii. edu/~casanova henric@hawaii. edu 2

Is MPI enough? n MPI submits the jobs using rsh/ssh n There is no

Is MPI enough? n MPI submits the jobs using rsh/ssh n There is no control of who runs what! n For multiple users in the cluster, we want to have privileges, authentication, fairshare. . . 3

Introducing Batch Schedulers n A job scheduler provides more features to control job execution:

Introducing Batch Schedulers n A job scheduler provides more features to control job execution: ¨ Interfaces to define workflows and/or job dependencies ¨ Automatic submission of executions ¨ Interfaces to monitor the executions ¨ Priorities and/or queues to control the execution order of unrelated jobs 4

Batch Schedulers n Most production clusters are managed via a batch scheduler: You ask

Batch Schedulers n Most production clusters are managed via a batch scheduler: You ask the batch scheduler to give you X nodes for Y hours to run program Z ¨ At some point, the program will be started. ¨ Later on you can look at the program output ¨ n This is really different from what you’re used to, and honestly is sort of painful ¨ n No interactive execution Necessary because: Since most applications are in this for high performance, they’d better be alone on their compute nodes ¨ There are not enough compute nodes for everybody at all times ¨ 5

Scheduling criteria n n n n n Job priority Compute resource availability License key

Scheduling criteria n n n n n Job priority Compute resource availability License key if job is using licensed software Execution time allocated to user Number of simultaneous jobs allowed for a user Estimated execution time Elapsed execution time Availability of peripheral devices Occurrence of prescribed events … 6

The case of GCB n Rocks allows us to install different job schedulers: SGE,

The case of GCB n Rocks allows us to install different job schedulers: SGE, PBS, LSF, Condor… n Currently we have SGE installed. n Sun Grid Engine is an open source DRM (Distributed Resource Manager) sponsored by Sun Microsystems and Collab. Net. It can be downloaded from http: //gridengine. sunsource. net 7

Our Cluster n n n You have (or soon will get) an account on

Our Cluster n n n You have (or soon will get) an account on the cluster Question: once I am logged in, what do I do? Clusters are always organized as ¨A front end node n n To compile code (and do minimal testing) To submit jobs ¨ Compute n nodes To run the code You don’t ssh to these directly In our case they are dual-proc Pentiums 8

How to use SGE as a user? n You need to learn how to

How to use SGE as a user? n You need to learn how to do three basic things Check the status of the platform ¨ Submit a job ¨ Check on job status ¨ n All can be done from the command line Read the man pages ¨ Google “SGE commands” ¨ n Checking on platform and job status qhost Information about nodes ¨ qstat –f Information about queues ¨ qstat –F [ resource ] Detailed information about resources ¨ qstat lists pending/running/done jobs ¨ 9

How to use SGE as a user? n (contd. ) Submitting and controlling jobs

How to use SGE as a user? n (contd. ) Submitting and controlling jobs ¨ qsub n We can pass the path to a binary or a script n qsub –b yes Submits a binary n qsub –q queue list Specifies to what queue the job will be sent n qsub –pe parallel-env n Allows to send a parallel job ¨ qdel Attempts to terminate a range of jobs n But for those of you who don’t like the command line… ¨ qmon n Be sure that you are forwarding X 11 and that you have a X server in your client machine! 10

How to use SGE as a user? n (contd. ) But sending a single

How to use SGE as a user? n (contd. ) But sending a single command is not very interesting… ¨ Submitting scripts n n Scripts can submit many jobs We can pass options to SGE and consult environment variables. Example: ¨ ¨ ¨ ¨ ¨ #$ -cwd Use the currend directory as work directory #$ -j y Join errors and output in the same file #$ -N get_date Give a name to the job #$ -o output. $JOB_ID Use a given file for output $JOB_ID: The job number assigned by the scheduler to your job $JOBDIR: The directory your job is currently running in $USER: The username of the person currently running the job $JOB_NAME: The job name specified by -N option $QUEUE: Current running queue 11

How to use SGE as a user? ¨ Submitting (contd. ) parallel jobs n

How to use SGE as a user? ¨ Submitting (contd. ) parallel jobs n We can define parallel-environments to execute this kind of jobs. Parallel environments define startup procedures, maximum number of slots, users allowed to submit parallel jobs… n Examples: mpich, lam … n SGE allows "Tight Integration" with MPICH by intercepting the calls MPICH makes to run your job on other machines, and replacing those calls with SGE calls so that it may better monitor and manage your parallel jobs. ( Source http: //rc. usf. edu/sge/submit. php ) n It is also possible to integrate other MPI flavors with SGE 12

How to use SGE as an admin? n Scheduler configuration. This values are found

How to use SGE as an admin? n Scheduler configuration. This values are found in /opg/gridengine/default/common/sched_configuration n These values can only be altered using qconf or qmon ¨ ¨ ¨ ¨ algorithm schedule_interval maxujobs queue_sort_method job_load_adjustments load_adjustment_decay_time load_formula schedd_job_info flush_submit_sec flush_finish_sec params reprioritize_interval halftime usage_weight_list compensation_factor … default 0: 0: 15 0 load np_load_avg=0. 50 0: 7: 30 np_load_avg true 0 0 none 0: 0: 0 168 cpu=1, mem=0, io=0 5 13

How to use SGE as an admin? (contd. ) n Queue configuration ¨ Queues

How to use SGE as an admin? (contd. ) n Queue configuration ¨ Queues are created with qmon or qconf n n n qconf –shgrpl show all host groups qconf -ahgrp group add a new host group qconf –shgrp group show details for one group qconf –sq queue shows a queue configuration qconf –Aq file create a queue from a file ¨ We’ll output a queue configuration to a file and modify it. ¨ Exercise: create a short/test job queue. Which are the best policies for this kind of queue? 14

Queue parameters ¨ ¨ ¨ ¨ ¨ n qname hostlist seq_no load_thresholds suspend_thresholds nsuspend_interval

Queue parameters ¨ ¨ ¨ ¨ ¨ n qname hostlist seq_no load_thresholds suspend_thresholds nsuspend_interval priority min_cpu_interval processors qtype ckpt_list pe_list rerun slots tmpdir shell … For the rest, type man queue_conf 15

But, are local schedulers enough? n Schedulers allow us to manage jobs in one

But, are local schedulers enough? n Schedulers allow us to manage jobs in one or more clusters, but there are still some limitations for more complex systems: ¨ Centralized job scheduling ¨ Computing nodes are in the same location ¨ Homogeneous software 16

Next step: GRID computing n GRID computing allows us to make distant, heterogeneous clusters

Next step: GRID computing n GRID computing allows us to make distant, heterogeneous clusters work together. ¨ Coordinate multiple resources (discovery, access, allocation, monitoring) ¨ Allow user authorization to provide secure access to resources ¨ Provide open standards to improve interoperability ¨ Give local control to organizations 17

How do we put everything together? 18

How do we put everything together? 18

Wrapping up: What do we have? 19

Wrapping up: What do we have? 19

In a nutshell… 20

In a nutshell… 20

Sending a job to SGE using GRAM n n n Create a personal certificate

Sending a job to SGE using GRAM n n n Create a personal certificate with grid-cert-request Have it signed by the local CA Create a proxy with grid-proxy-init Submit it with globus-job-run localhost/jobmanager-sge /bin/hostname This is a very simple example that uses pre-WS GRAM services Globus still gives us more advanced features: ¨ File staging ¨ RSL (Resource Specification Language) and JSDL (Job Submission Description Language) ¨ Access across organization boundaries ¨ … 21

Some examples n Hurricane mitigation n Metascheduling and job flow management 22

Some examples n Hurricane mitigation n Metascheduling and job flow management 22

Conclusion There is still a lot to explore! 23

Conclusion There is still a lot to explore! 23