Local batch systems Alessandro Italiano INFNCNAF 1 Batch
Local batch systems Alessandro Italiano INFN/CNAF 1
Batch system definition • A system that executes a series of commands which are all given before the program starts to run, instead of an interactive system which requires the user to give commands during the operation. 2
Basic components Batch system Computing Resources Queue A Queue B Type A Type B Type C type D Scheduler Queue C Queue D 3
Batch system in the grid Computing. Element GK GRIS LRMS (batch system) 4
Common batch systems • PBS – Open. PBS – Torque + Maui – PBS PRO • LSF • Condor • BQS 5
Torque + Maui • Maui is an advanced job scheduler that supports: – – Large array of scheduling policies Dynamic priorities Extensive reservations Fairshare • TORQUE (Tera-scale Open-source Resource and QUEue manager) – providing control over batch jobs and distributed compute nodes – It is a community effort based on the original *PBS project – It has incorporated significant advances in the areas of scalability and fault tolerance • More info at www. supercluster. org 6
Directories structure • Install Torque and Maui as described at http: //grid-it. cnaf. infn. it/fileadmin/sysadm/siteinstall-2_3_0. html • Start the services service pbs_server start service maui start service pbs_mom start Maui: # ls -1 /var/spool/maui/ maui. cfg maui. ck. 1 maui. pid stats tools traces Torque: # ls -1 /var/spool/pbs/ aux mom_logs mom_priv pbs_environment. rpmnew pbs_server. conf sched_logs sched_priv server_logs server_name. rpmnew server_priv spool 7
Maui configuration for static resources allocation • Max. Job CLASSCFG[cms] MAXJOB=300 MAXPROC=300 USERCFG[cmssgm] MAXJOB=200 MAXPROC=200 • Common resources SRCFG[overflow] HOSTLIST=grid-wn[0 -5][0 -9]. cnaf. infn. it CLASSLIST=cms-, atlas-, lhcb-, alice-, virgo-, cdf-, babar-, dteam-, argo-, magic-, ams- • Dedicated resources SRCFG[cms] HOSTLIST=grid-wn 6[0 -9]. cnaf. infn. it CLASSLIST=cms GROUPLIST=cms 8
Maui configuration for dynamic resources allocation • Fair. Share http: //grid-deployment. web. cern. ch/grid-deployment/documentation/Maui-Cookbook. html FSPOLICY FSDEPTH FSINTERVAL FSDECAY FSWEIGHT FSUSERWEIGHT FSGROUPWEIGTH DEDICATEDPS 7 24: 00 0. 8 1 5 30 GROUPCFG[dteam] GROUPCFG[alice] GROUPCFG[atlas] GROUPCFG[cms] GROUPCFG[babar] GROUPCFG[lhcb] GROUPCFG[dzero] GROUPCFG[hone] GROUPCFG[zeus] FSTARGET=1 FSTARGET=50 FSTARGET=20 FSTARGET=10 FSTARGET=11 FSTARGET=5 MAXPROC=10, 1000 MAXPROC=500, 1000 MAXPROC=200, 1000 MAXPROC=10, 1000 MAXPROC=100, 1000 MAXPROC=110, 1000 MAXPROC=50, 1000 9
Common problems • • Understand the problem Check log files – – – • Usefull commands – – • maui. log server_logs/20050221 mom_logs/20050221 tracejob –n <n day> <jobid> : job history qstat –n : job status and node execution qstat –Q : farm status pbsnodes –l : nodes down Check the website for common problems solution http: //goc. grid. sinica. edu. tw/gocwiki/ 10
LSF • Why LSF ? ? – CERN use it • More support for GRID integration – It should be more • Scalable • Fault tollerence • Well supported – Same functionalities for resources allocation – Administrion • More complicated • More configuration files • More Processes 11
- Slides: 11