The Grid An Atlas Physicists Experience M Hodgkinson

  • Slides: 39
Download presentation
The Grid - An Atlas Physicists Experience M. Hodgkinson University of Sheffield 18 April

The Grid - An Atlas Physicists Experience M. Hodgkinson University of Sheffield 18 April 2006

Contents • • What do I use the grid for? Why do I need

Contents • • What do I use the grid for? Why do I need the grid? Job Submission/Control Running on the grid How to get help Further reading Conclusions 11/4/2020 2

Why do I use the grid? • I am a software developer/physicist • Most

Why do I use the grid? • I am a software developer/physicist • Most of my time is spent developing an energy flow based algorithm for ATLAS • I need lots of cpu time and the grid is the only place I can find enough 11/4/2020 3

What do I use the grid for? • My interest is to see if

What do I use the grid for? • My interest is to see if an ALEPH based energy flow algorithm is useful for ATLAS • Combines Calorimeter and tracking information • Low pt tracks better measured than low pt calorimeter energy deposits • Potential usage for analyses where you care about jet resolutions, etmiss resolutions. • Also btagging, Susy Bg rejection 11/4/2020 4

What do I use the grid for? • I study the performance of this

What do I use the grid for? • I study the performance of this using QCD dijet events • I can reconstruct 6000 events in 2 hours…but I need 120 cpus! • Lxplus at cern has a batch system…but often I wait 3 -4 days before a job will start to run. • Jobs on grid almost always start straight away, so it is the solution for me 11/4/2020 5

Job Submission • Standard batch systems: 1. qsub -eo test. log test. sh 2.

Job Submission • Standard batch systems: 1. qsub -eo test. log test. sh 2. bsub -eo test. log Moose. App test. tcl etc • Grid more complicated - one needs to tell the system information on what kind of system the job can be run e. g. name of collaboration, cpu, memory, software release etc • This is controlled from a *. jdl file 11/4/2020 6

Getting Started • Must log into grid User Interface (UI) - e. g. I

Getting Started • Must log into grid User Interface (UI) - e. g. I use lxplus. cern. ch • Setup grid environment - I use /afs/cern. ch/project/gd/LCGshare/sl 3/etc/profile. d/grid_env. sh • If you are @cern then use this file, else you need to ask a local expert how to setup your account on your UI (and find out which machine it is) 11/4/2020 7

Job Options Jdl Files [ Executable = “test. sh”; Input. Sandbox = {“<full_path>/test. sh”}

Job Options Jdl Files [ Executable = “test. sh”; Input. Sandbox = {“<full_path>/test. sh”} Output. Sand. Box = {“stdout”, ”stderr”} stdoutput = “stdout”; stderror = “stderr” Arguments = “ 11. 0. 41 lcgse 0. shef. ac. uk <input_location> <input_file> <output_file> <nevent>” ] Atlas Release No 11/4/2020 Storage Element 8

What do the options mean? • Release no: • Atlas - 11. x. 0,

What do the options mean? • Release no: • Atlas - 11. x. 0, 11. 0. x, 10. x. 0, 10. 0. x • Babar - analysis-14, analysis-13 a, 12. 4. 0 e • Storage Element • Where output files will be stored • e. g. lcgse 0. shef. ac. uk, castorgrid. cern. ch etc • Job. Options: • Atlas - what goes in the python file • Ba. Bar - what goes in the tcl file • Concrete example (Atlas) later on • Need more than this though • Need to specify where the job should run 11/4/2020 9

Further Jdl [ Environment = {"T_LCG_GFAL_INFOSYS=atlasbdii. cern. ch: 2170"}; Virtual. Organisation = “atlas”; Requirements

Further Jdl [ Environment = {"T_LCG_GFAL_INFOSYS=atlasbdii. cern. ch: 2170"}; Virtual. Organisation = “atlas”; Requirements = Member(“V 0 -atlas-release-11. 0. 41, other. Glue. Host. Application. Software. Run. Time. Environment) && other. Glue. Host. Network. Adapter. Outbound. IP = True && other. Glue. Host. Main. Memory. RAMSize >= 512 && other. Glue. CEPolicy. Max. CPUTime > 1000 && (!Reg. Exp(“ce 1. pp. rhul. ac. uk: 2119”, other. Glue. CEUnique. Id)); Rank = ( other. Glue. CEState. Waiting. Jobs == 0 ? Other. Glue. CEState. Free. CPUs : - other. Glue. CEState. Waiting Jobs) ; ] 11/4/2020 10

Executables 1 export T_RELEASE=“$1” export T_LCN=“$2” export T_SE=“$3” export T_INFN=“$4” export T_OUTFN=“$5” export T_NEVT=“$6”

Executables 1 export T_RELEASE=“$1” export T_LCN=“$2” export T_SE=“$3” export T_INFN=“$4” export T_OUTFN=“$5” export T_NEVT=“$6” 11/4/2020 11

Executables 2 Setup Atlas Release export ATLAS_ROOT=$VO_ATLAS_SW_DIR/software/${T_RELEASE} source ${ATLAS_ROOT}/setup. sh [“$GCC_SITE” == “”] &&

Executables 2 Setup Atlas Release export ATLAS_ROOT=$VO_ATLAS_SW_DIR/software/${T_RELEASE} source ${ATLAS_ROOT}/setup. sh [“$GCC_SITE” == “”] && export GCC_SITE=${ATLAS_ROOT}/gcc-alt-3. 2 export PATH=${GCC_SITE}/bin: ${PATH} export LD_LIBRARY_PATH=${GCC_SITE}/lib: ${LD_LIBRARY_PATH} export T_DISTREL=${SITEROOT}/dist/${T_RELEASE} source ${T_DISTREL}/Control/Athena. Run. Time/*/cmt/setup. sh export LCG_CATALOG_TYPE=“lfc” export LFC_HOST=“lfc-atlas-test. cern. ch” Ask local expert in your collaboration 11/4/2020 Set grid catalog type 12

Get input file Executables 3 current_dir=`pwd` lcg-cp --vo atlas lfn: ${T_LCN}/${T_INFN} file: $current_dir}/${T_INFN} T_HOMEDIR=${PWD}

Get input file Executables 3 current_dir=`pwd` lcg-cp --vo atlas lfn: ${T_LCN}/${T_INFN} file: $current_dir}/${T_INFN} T_HOMEDIR=${PWD} ulimit -Sv 1300000 Rec. Ex. Common_links. sh WORKING_DIR=$PW�D tar xzf 11 Patches. tgz cmt config source setup. sh cd Test. Release/*/cmt Set up athena and T_HOMEDIR for later use cmt broadcast cmt config source setup. sh cmt broadcast gmake cp ${T_HOMEDIR}/${T_INFN}. � cp ${T_HOMEDIR}/bmagatlas 03_test. data. 11/4/2020 Compile code Move input files to run location 13

Executables 4 cat > my. Params. py << EOF Evt. Max=$T_NEVT Pool. RDOInput=[“T_INFN”] Copy

Executables 4 cat > my. Params. py << EOF Evt. Max=$T_NEVT Pool. RDOInput=[“T_INFN”] Copy and Root. Ntuple. Output=“T_OUTFN” on grid do. Write. AOD=False do. Write. ESD=False Det. Descr. Version=‘Rome-Initial-v 00’ Include (“Rec. Ex. Common/Rec. Ex. Common_top. Options. py”) EOF athena. py my. Params. py register file Run athena lcg-cr -v -l /grid/atlas/datafiles/mhodgkin/${T_OUTFN} -n 8 -d ${T_SE} -t 300 --vo atlas file: `pwd`/${T_OUTFN} Write job. Options on the fly 11/4/2020 14

Control of jobs on the Grid 11/4/2020 15

Control of jobs on the Grid 11/4/2020 15

Find site rankings List of rank ordered sites 11/4/2020 16

Find site rankings List of rank ordered sites 11/4/2020 16

Submit to specific site with -r option Submit to site chosen by ranking expression

Submit to specific site with -r option Submit to site chosen by ranking expression Get status of jobs 11/4/2020 17

Cancel a job Job enters done status Get and view output 11/4/2020 18

Cancel a job Job enters done status Get and view output 11/4/2020 18

List contents of grid directory 11/4/2020 19

List contents of grid directory 11/4/2020 19

Copy output file Delete output file 11/4/2020 20

Copy output file Delete output file 11/4/2020 20

Go to www. ggus. org Need to register 11/4/2020 Once registered can submit tickets

Go to www. ggus. org Need to register 11/4/2020 Once registered can submit tickets 21

11/4/2020 22

11/4/2020 22

Convert grid certificate into browser readable file New file! 11/4/2020 23

Convert grid certificate into browser readable file New file! 11/4/2020 23

11/4/2020 24

11/4/2020 24

11/4/2020 25

11/4/2020 25

11/4/2020 26

11/4/2020 26

11/4/2020 27

11/4/2020 27

Pick correct email address 11/4/2020 28

Pick correct email address 11/4/2020 28

Fill in details 11/4/2020 29

Fill in details 11/4/2020 29

Some typical problems • You do a grid-proxy-init and 11 hours later submit a

Some typical problems • You do a grid-proxy-init and 11 hours later submit a job that lasts 2 hours. • It will fail because your proxy ran out! • Solution is to renew your “proxy” or change the default 12 hours via: – grid-proxy-init -valid H: M (e. g. 24 hours is 24: 00) 11/4/2020 30

Typical Problems • When you do lcg-cp you don’t use the timeout (-t) option

Typical Problems • When you do lcg-cp you don’t use the timeout (-t) option - your job will hang and you cannot get the stdout/err to see what went wrong • Missing libraries (e. g. X 11) or missing setup scripts for athena - in both cases all one can do is use the ggus system to complain and hopefully the site admin will fix the problem 11/4/2020 31

Further Reading • LCG User Guide: • http: //egee. itep. ru/User_Guide. html • Atlas

Further Reading • LCG User Guide: • http: //egee. itep. ru/User_Guide. html • Atlas grid job submission interfaces: • https: //uimon. cern. ch/twiki/bin/view/Atlas/LJSF • https: //twiki. cern. ch/twiki/bin/view/Atlas/DAon. Panda Beware! Atlas uses 3 grids. I have talked about using the LCG grid, but Panda does not use the LCG grid. You cannot see a file registered on LCG from grid 3 or nordugrid. 11/4/2020 32

Conclusions • Have seen how to make basic jdl files and a shell script

Conclusions • Have seen how to make basic jdl files and a shell script to control the job • Have seen how to submit jobs and do the usual kinds of job control • Have seen how to get help • Everything in this talk should work so if you run into problems send me a mail. . • …though for non-atlas people some translation to <your_collaboration> may be required! 11/4/2020 33

Backup Slides 11/4/2020 34

Backup Slides 11/4/2020 34

Compiling atlas software on grid mkdir Patches Need to make files called requirements. template

Compiling atlas software on grid mkdir Patches Need to make files called requirements. template and setup. sh 11/4/2020 35

11/4/2020 36

11/4/2020 36

11/4/2020 37

11/4/2020 37

cmt co Test. Release Then in the Test. Release requirements file add: 11/4/2020 38

cmt co Test. Release Then in the Test. Release requirements file add: 11/4/2020 38

Final Step • Finally one should check out whatever packages one wants to compile

Final Step • Finally one should check out whatever packages one wants to compile (in the Patches directory) 11/4/2020 39