Basic troubleshooting common problems Marco Cecchi Daniele Cesini

Basic troubleshooting & common problems Marco Cecchi – Daniele Cesini INFN – CNAF

Misuses of the Grid • Firstly, try not to misuse the Grid – Your typical needs are covered by the infrastructure & the various services it is made of • Sandboxes exist for small amounts of data. Use data management when dealing with huge files. • As above, don't stress too much the infrastructure – Submitting a single collection of 10000 nodes wouldn't probably be a good move • The JDL should be kept as 'grid-oriented' as possible, that is without explicit mentions to specific resources we know something about

Troubleshooting

Cannot even submit There could be many other problems during job submission due to a GSI handshaking while authenticating. • Error while calling the "NSClient: : multi" native api Authentification. Exception: Failed to establish security context. . . – Tell your admin to check if your DN is in the /etc/grid-security/gridmapfile on the WMS. – It is worth trying a globus-url-copy in debug mode to have a clearer message. – Actually this problem can occur for a number of reasons, from the user's side, what you can do is check the validity of your proxy, permissions etc.

Cannot even submit • Unable to register the job to the service: https: //: 7443/glite_wms_wmproxy_server – LB or LBProxy have become unresponsive. • Connection failed: Connection refused connect failed in tcp_connect() Error code: SOAP-ENV: Client – WMProxy is down

Some (even) more cryptic errors [mcecchi@cert-ui-01 mcecchi]$ glite-wms-job-status https: //albalonga. cnaf. infn. it: 9000/0 YSs. U 4 v. Ov 9_0 XIh. O 5 ajv. BQ ******************************* BOOKKEEPING INFORMATION: Status info for the Job : https: //albalonga. cnaf. infn. it: 9000/0 YSs. U 4 v. Ov 9_0 XIh. O 5 ajv. BQ Current Status: Aborted Status Reason: hit job retry count (0) Destination: atlasce 01. na. infn. it: 2119/jobmanager-lcgpbs-cert Submitted: Tue Nov 20 12: 56: 26 2007 CET ******************************* at this point you are more interested to the next-to-last error, more than to this one. A logging-info would surely help. . .

Some (even) more cryptic errors glite-job-logging-info shows: "Got a job held event reason: The Periodic. Hold expression 'Matched = TRUE && Current. Time > QDate + 900' evaluated to TRUE" * Condor could not submit the job to CE in more than 900 seconds. * Probably Condor-C on CE could not be launched: o because the authentication failed. o The previous launcher jobs failed but still in condor queue. Remove it by condor_rm. o IP address is incorrect in /etc/hosts. * Possibly because of firewall.

Some (even) more cryptic errors • "Got a job held event, reason: Spooling input data files" * It may fail with "Globus error 7: authentication with the remote server failed". * Race condition between the gridmanager on machine A querying the job status of the job on machine B and the schedd on machine B releasing the job after file stage-in, fixed in later version of condor (which version ? ).

Some (even) more cryptic errors • glite-job-logging-info shows error message "Cannot take token!" * Check that the glite/edg-gridftp-clients or uberftp package are installed on the WNs. * Proxy expired before job executing and could not be renewed.

Some (even) more cryptic errors • glite-job-logging-info shows: "Cannot read Job. Wrapper output, both from Condor and from Maradona" The user job exit status failed to be delivered to the WMS, when two independent methods should have been tried: 1. The user job exit status is written into an extra "Maradona" file that is copied to the WMS with globus-url-copy (or htcp). 2. The job wrapper script writes the user job exit status to stdout, which is supposed to be sent back to the WMS by Globus. When both methods fail, it usually means that the job did not run to completion! That means it either did not start at all.

Some (even) more cryptic errors • glite-job-logging-info shows "Got a job held event, reason: Error connecting to schedd. . . " * Condor met timeout when connecting sched on g. Lite CE. * Possibly because of unstable network, or a disk fills up somewhere.

Some (even) more cryptic errors • glite-job-logging-info shows "Got a job held event, reason: Attempts to submit failed" * It means that the job could not be successfully handed over the batch system by the nonprivileged user that resulted from the GRAM/LCMAPS. * For example, BLParser not running on batch system head node.

Some (even) more cryptic errors • Some error messages sometimes do not reflect the real cause of the trouble. Example: – A job fails with a status reason “Got a job held event, reason: Globus error 3: an I/O operation failed” – You might think you are having a network problem or a communication problem between grid elements. – Not necessarily. This error is mostly due to shortage of memory on the RB or CE or WN. From the ROLLOUT mailing list: • “The problem was that memory was very low. queue_submit() in Helper. pm of GRAM checks for memory and returns a NORESOURCES error if the free memory is less than 2% of the total, NORESOURCES is GRAM error 3, not necesarily IO. The reason for that was that interlogd was using 717 MB of RAM, so I restarted it with: /etc/init. d/locallogger restart” – The problem can also be due to lack of disk space or quota or a permission problem with the pool account home directory – I have to admit it: it could also be a hardware I/O error.

Some more readable messages • Load limiter is kicking you out [cesini@lcg-ui corso]$ glite-wms-job-submit -a -c wms_rb 03. conf first. jdl Connecting to the service https: //cert-rb-03. cnaf. infn. it: 7443/glite_wms_wmproxy_server Error - Operation failed Unable to register the job to the service: https: //cert-rb 03. cnaf. infn. it: 7443/glite_wms_wmproxy_server System load is too high: Threshold for Load Average(1 min): 0 => Detected value for Load Average(1 min): 0. 25 Threshold for Load Average(5 min): 0 => Detected value for Load Average(5 min): 0. 14 Threshold for Load Average(15 min): 0 => Detected value for Load Average(15 min): 0. 10 Method: job. Register Error code: 1228

Some more readable messages • Job has been terminated by the batch system – For some reason (most likely to be that the job lasted too long), a termination signal was generated by the batch system. – The above message is only a sufficient condition, not necessarily so. Jobs might happen to be killed, the Grid not knowing about that and aborting the job after some timeout.