Getting Started OLCF Bill Renaud OLCF User Support
Getting Started @ OLCF Bill Renaud OLCF User Support ORNL is managed by UT-Battelle for the US Department of Energy
General Information • This presentation covers some helpful information for new users of OLCF – – How to stay informed about OLCF happenings Common error messages How our systems may differ from others you’ve used How to get help • This is by no means an all-inclusive presentation • Feel free to ask questions 2 Getting Started at OLCF
Staying Informed
Staying Informed • OLCF provides multiple layers of user notifications about system status and downtimes – Email • OLCF Weekly Update (sent to “ccs-announce” list) • System-specific lists (Both “high-volume” and “low-volume” lists) – Status indicators on olcf. ornl. gov – Twitter (@OLCFStatus) • A summary of these items can also be found at http: //www. olcf. ornl. gov/kb_articles/communications-to-users/ 4 Getting Started at OLCF
Staying Informed-Weekly Update • Sent weekly (Thu/Fri) • Contains – Announcements about upcoming training – Announcements about system changes – Planned outages • All OLCF users should receive this email – Let us know if you’re not receiving it 5 Getting Started at OLCF
Staying Informed-System Status • Logs from monitoring software parsed to make educated guess on system status • Status is sent to multiple destinations – Websites (http: //www. olcf. ornl. gov) – Twitter (@OLCFStatus) – Email lists • Fairly accurate, but still a fully automated process – Possibility of both false positives and false negative – We do take some measures to mitigate this 6 Getting Started at OLCF
Staying Informed-System Status 7 Getting Started at OLCF
Titan and Eos: Differences from “standard” clusters
Titan/Eos Nodes • Titan and Eos are heterogeneous systems – Compute vs. Service vs. External Login Nodes • • Different responsibilities Different network configurations Different hardware (CPU/Memory) How accessed (directly vs. aprun only) – Node differences can have multiple impacts • “Callback” routines to the service node must use the Gemini interface • Direct access to/from nodes outside the system • Different processor architectures can lead to code failures 9 Getting Started at OLCF
Compiling for the XK 7 and XC 30 • Compiling on Titan/Eos may differ greatly from your previous experience • Because of different node types, you are actually cross-compiling – This can make utilities such as autoconf and cmake challenging to use • Compiling for batch/login nodes – Not common, but occasionally necessary – See § 7. 2 of the Titan User Guide for examples https: //www. olcf. ornl. gov/support/system-user-guides/titan-user-guide/ 10 Getting Started at OLCF
Compiling for the XK 7 and XC 30 • Compiler is determined by a mix of craype-* and Prg. Env-* modules – Prg. Env loads “real” compiler, math, MPI, etc. modules – craype loads compiler wrappers • Compilers are invoked with “wrappers” rather than vendor-specific names – cc for C, CC, for C++, and ftn for Fortran – No need to remenber pgf 90, ifort, gcc, craycc, etc. • MPI, math, & scientific libraries automatically linked – No –lmpi, -lblas, etc. – This is can be challenging to cmake, autoconf, et al. 11 Getting Started at OLCF
Running Batch Jobs • Our batch system is Torque combined with Moab (both from Adaptive Computing) – Users interact with Torque with PBS-like commands (qsub, qstat, etc. ) – Users interact with Moab with other commands (showq, mdiag, etc. ) – There’s a helpful set of commands on the next slide • While jobs are submitted with the same commands, the parallel job launcher can differ – Titan and Eos: aprun – Clusters: mpirun 12 Getting Started at OLCF
Helpful Batch System Commands Command Description/Use qsub Submit a batch job qdel Delete a batch job qhold Hold a job (keep it from entering a run state) showq Display the current queue status (preferred over qstat) checkjob Show details about a given job, including why it isn’t running mdiag Show diagnostic information about the queue showres Show current reservations 13 Getting Started at OLCF
Scheduling Policies • Each system has its own batch queue structure and scheduling policy – Specifics are given in the system User Guides on our website – Note that Titan’s queue policy favors large jobs • Jobs are charged based on what you make unavailable to others, not what you use 14 Getting Started at OLCF
(Un)Common Error Messages
Common Error Messages • relocation truncated to fit: R_X 86_64_PC 32 – Program is using too much static memory – Limit on Titan is 2 GB – Solution: use dynamic memory allocation as much as possible • Illegal Instruction – A code compiled for compute nodes was executed on a non-compute node – Solution: Run on compute nodes or recompile for login nodes, as appropriate 16 Getting Started at OLCF
Common Job Error Messages • request exceeds max nodes alloc – Your aprun command requires more nodes than you can access/have been allocated to you • The aprun request requires more nodes than the job requested • Request was correct, but at launch time some nodes were discovered to be down (see potential fix later in this presentation) – Solution • Make sure you requested enough nodes • Consider “over-requesting” nodes (details later) 17 Getting Started at OLCF
Common Job Error Messages • aprun: [NID 10294]Exec mycode. x failed: chdir /autofs/na 1_home/user 1 No such file or directory – Remember: compute nodes only mount Lustre directories – You must be in a directory visible to compute nodes when you run aprun • in this example, it was launched from $HOME – Any files used by the processes on compute nodes must also be in Lustre • The executable itself need not be in Lustre 18 Getting Started at OLCF
Software Notes
Finding & Using Software • Installation location varies – Some is part of the default environment – Other software typically managed via ‘modules’ • • Much of this software is in /sw Compilers, libraries (MPI/GPU), etc. use modules (but aren’t in /sw) “module avail”, “module load”, etc. More information is available on the OLCF website • Some basic usage information is on the website • We have local experts for some items – Optimization (Vampir, Score. P, Cray. PAT) – Debugging (DDT) 20 Getting Started at OLCF
Requesting/Installing Software • You are free to install software in your directories – Subject to export control, license agreements, etc. • You can ask us to install software/sw – Typically for software that’s of interest to a number of users – To request this, • Email us at help@olcf. ornl. gov, or • http: //www. olcf. ornl. gov/support/software-request/ – Requests are reviewed by our software council 21 Getting Started at OLCF
Software Updates • We have control over some software – We’re moving to a model of /sw updates at certain intervals • Not all minor versions will be installed • We’ll provide build “recipes” in case you want/need a minor version • We don’t receive all software directly from the software vendor – Some goes through testing by the hardware vendor first • This affects compilers, CUDA, etc. – We may not be able to install it just after it’s released • But we work to do so as soon as possible 22 Getting Started at OLCF
Other Useful Tips
Receiving OLCF communications • Many important announcements are sent via email • Make sure OLCF email addresses aren’t in spam filters – help@nccs. gov – help@olcf. ornl. gov – *-announce@email. ornl. gov • Especially ccs-announce – *-notice@email. ornl. gov 24 Getting Started at OLCF
Determining why a job isn’t running • Could be one (or more) of any number of reasons • Use checkjob <jobid> to diagnose – Use –v for verbose mode – Reason for job not running is usually near the end (although the verbiage may be confusing) • Use showres to see upcoming reservations/outages – Also shows running jobs…look carefully • Reservations for running jobs are usually the (numeric) job ID • Reservations for outages are usually alphanumeric 25 Getting Started at OLCF
Determining why a job isn’t running • Insufficient resources available NOTE: job cannot run (insufficient available procs: 3264 available) • Unresolved dependency NOTE: job cannot run (dependency 1851375 jobcomplete not met) • Queue policy issue NOTE: job violates constraints for partition titan (job 1854474 violates active HARD MAXJOB limit of 2 for qos smallmaxjobs user partition ALL (Req: 1 In. Use: 2)) BLOCK MSG: job 1854474 violates active HARD MAXJOB limit of 2 for qos smallmaxjobs user partition ALL (Req: 1 In. Use: 2) (recorded at last scheduling iteration) 26 Getting Started at OLCF
Determining why a job isn’t running • Pending maintenance period/reservation Reservation. ID Type S Start End DTNOutage. 127 PM. 128 PM_login. 129 User - 15: 05: 13 1: 03: 05: 13 Duration N/P Start. Time 12: 00 11/176 Tue Jan 28 08: 00 12: 00 18688/299008 Tue Jan 28 08: 00 12: 00 16/2048 Tue Jan 28 08: 00 – This is example output from showres – Remember that scheduled outages typically have alphanumeric Reservation. IDs – Note the full system reservation starting in 15 hours…am I asking for more than 15 hours in my job? 27 Getting Started at OLCF
Determining when a job will start • Priority is essentially FIFO with certain adjustments – Job size – Project over allocation – Backfill • Job start time can never be known for sure – Jobs yet-to-be submitted can affect run order – We can only guess based on the current queue 28 Getting Started at OLCF
Determining when a job will start • Several commands available to help – showstart <jobid> • not always reliable – showq output is sorted by priority (can be helpful) – mdiag –p gives detailed priority info, which can be help you understand what’s contributing to your job’s priority • Shows boosts based on job size • Shows overallocation penalties 29 Getting Started at OLCF
Dealing With Failed Nodes • Sometimes nodes fail when your job starts – Since node allocation has already happened, the batch system can’t replace them – You can work around this by requesting “extra” nodes • Place aprun in a loop • Keep looping unless aprun returns a successful exit code #!/bin/bash #PBS –lnodes=104. . . APRUN_RETURN_VALUE=1 while [[ $APRUN_RETURN_VALUE –ne 0 ]]; do aprun –n 1600. /a. out APRUN_RETURN_VALUE=$? done. . . 30 Getting Started at OLCF
Finding your project’s ID and allocation • Use showproj and showusage to list projects and usage, respectively • Both commands have a help option (run with –h) $ showproj User 1 is a member of the following project(s) on titan: stf 007 $ showusage titan usage for the project's current allocation period: Project Totals user 1 Project Allocation Usage Remaining Usage __________________________|________ stf 007 7350000 | 11138 7338862 | 12 31 Getting Started at OLCF
Things to know about project allocations • Projects are NOT disabled for going over allocation – Job priority reduced to facilitate “fairshare” with those that aren’t over their allocation – Reduction is based on usage vs. allocation • 30 day penalty for slightly over (usage 100 -125% of allocation) • 365 day penalty for usage >125% of allocation • For this reason, we don’t issue refunds (per se) – If many jobs were affected by system problems, we can delay the priority reduction – Same net effect as a refund but easier to manage 32 Getting Started at OLCF
Getting Help
Requesting a priority boost/higher walltime limit/purge exemption/etc • Make request via https: //www. olcf. ornl. gov/support/documents-forms/ – Under the heading: Forms to Request Changes to Computers, Jobs or Accounts • Reviewed by Resource Utilization Council – Please make requests well in advance to allow for review – If requesting job priority, make sure you submit the job…they often run more quickly than you expect 34 Getting Started at OLCF
Documentation • http: //www. olcf. ornl. gov – User Guides: https: //www. olcf. ornl. gov/support/system-user-guides/ – Knowledgebase: https: //www. olcf. ornl. gov/support/knowledgebase/ • Stack Overflow http: //stackoverflow. com/questions/tagged/cray • http: //docs. cray. com (Cray. Docs) • http: //docs. nvidia. com (Nvidia hosted documentation) 35 Getting Started at OLCF
Working With User Support • Email is often the best option to contact us – Especially for sending long/complicated error messages – Send as many error messages as possible • Or place them in a file & direct us to it • Start a new ticket for new issues instead of replying to an old ticket – Gives it greater visibility – Helps us in classifying/searching through old tickets • When sending codes, create a. tar file & direct us to it – Include all files necessary to run – More efficient than sending via email 36 Getting Started at OLCF
Finally… • We’re here to help you • Questions/comments/etc. can be sent to the OLCF User Assistance Center – Staffed 9 AM – 5 PM US Eastern Time, exclusive of ORNL holidays – help@olcf. ornl. gov – (865) 241 -6536 37 Getting Started at OLCF
- Slides: 37