JRA 1 ITCZ cluster meeting January 24 25

LCG problems hopefully already addressed • The bugs below are still open in the

LCG issues not addressed yet • #3302: On a RB+SE node there is a

LCG issues not addressed yet • #3871: edg-wl-bkserverd: Terminating after 500 connections § 'event_store_recover’

LCG issues not addressed yet • #4570: Multiple cancel requests can crash WM (and

LCG issues not addressed yet • #5404: JC/LM id repository § Inconsistency between the

LCG issues not addressed yet • #6134: Using Python 2. 3. 4 with WMS

Issues addressed by LCG that we didn’t integrate yet • #3931: Suggest a local

GLITE problems hopefully already addressed • The bugs below are still open in the

GLITE issues not addressed yet • #5278: lack of logging information for the workload_manager

Slides: 10

Download presentation

JRA 1 IT-CZ cluster meeting, January 24 -25, 2004 www. eu-egee. org LCG and Glite open issues Massimo Sgaravatto INFN Padova EGEE is a project funded by the European Union under contract IST-2003 -508833

LCG problems hopefully already addressed • The bugs below are still open in the LCG Savannah, but they have already been addressed § Patches provided (by us, or by LCG) • Still open because patches under test/still to be tested • #2716, #3252, #3546, #3807, #3848, #3883, #3884, #3895, #3896, #3900, #3916, #4009, #4047, #4070, #4098, #4109, #4126, #4127, #4144, #4318, #4378, #4836, #4891, #4909, #5237, #5238, #5244, #5261, #5269, #5274, #5351, #5471, #6163 <event>, <date> - 2

LCG issues not addressed yet • #3302: On a RB+SE node there is a Grid. FTP problem § Asked for clarifications to LCG: no answer § Not considered a high priority problem • #3671: To drain an RB § They would like to make possible to disallow new submissions, while allowing the other commands § Asked to LCG if the idea discussed last time to look for a given file, created by the admin on the Broker (if the file exists the NS will drain all submissions) • No feedback so far • #3724: Log. Monitor should be resilient to full file system § Still to be understood why irepository. dat could not be recovered § Actually not investigated further • #3808: Network. Server must log from which UI the job was submitted § A patch was provided, but it logs the UI address and the user DN in *separate* messages (and it is not possible to unambiguously connect them) § Asked if instead they could use the LB info instead: no answer <event>, <date> - 3

LCG issues not addressed yet • #3871: edg-wl-bkserverd: Terminating after 500 connections § 'event_store_recover’ likely a inter-thread locking bug, which must be investigated • #4319: Suggestion for change of policy for resubmitted jobs § Basically they (D. Smith) think that if the job doesn’t even start its execution on a WN, this should not be counted as (re)submission § Fix applied by David Smith under test: • Logging of “running” by LRMS as priority event, and return code checked – if the logging doesn't appear successful, the job script returns an indication of the error in the output/maradona and exits without starting the job • No events logged by the LRMS the job didn’t start shallow resubmission can be performed • The maximum number of these new type of resubmissions per job is a broker side configuration option • The new resubmissions won't be done if doing so would send the job back to a previously tried destination § No feedback on the testing-certification of these changes <event>, <date> - 4

LCG issues not addressed yet • #4570: Multiple cancel requests can crash WM (and possibly PR) § Addressed for PR § For WM already discussed (it would require major modifications) • #4894: NS can become unresponsive during dialogue with client § Marco agreed with D. Smith to review that part of code • #5347: FD limit for LM § D. Smith changed the system hard limit on file descriptors for the LM (to 16384) because of the big number of condor. G logfiles (and associated state files) § This was not sufficient; at some points in the code (eg. in dgssl. c) select()s are done on fd sets which of type 'fd_set'. These are only large enough for 1024 descriptors § Asked for clarifications: no feedback so far <event>, <date> - 5

LCG issues not addressed yet • #5404: JC/LM id repository § Inconsistency between the JC (memory resident) id repository and the LM (disk resident) version § This happened when a daemon was down for a while § Each daemon needs to know if its partner is live or dead § Proposal (each one writes a file with an epoch and updates it every m seconds; if the date in the partner file is older than a threshold this means that the partner is dead and so a more or less drastic solution can be taken) submitted to LCG for feedback: no feedbacks so far • #5442: Setting output path for LCG GUI Job Monitor § Actually the problem was that the user didn’t read the doc § The only problem that needs to be fixed is that the GUI always try to use the home directory for the retrieval of the OSB (it doesn’t remember the previous choice) • #5549: NS cannot handle being addressed through RB host alias § Necessary to use libresolv and rely on the ‘real’ name § Marco. P <event>, <date> - 6

LCG issues not addressed yet • #6134: Using Python 2. 3. 4 with WMS UI § As far as I understand only two issues in our domain • UIutils. py and UIchecks. py should be in …/lib/python and not in. . . /bin • Warning to be removed • #6210: Unable to Register the Job through RB host alias § Alias pointing to 2 RBs and the DNS alternates between the two nodes § Significant changes would be needed § LCG people classified this as feature request (lowest priority) • #6295: RB problem if the Output. Data attribute is too big § Job submission hangs with a long Output. Data (JDL ? ) attribute § Same problem (in socket++ ? ) with JDL > 9 K seen by Luciana Carota testing DAG ? <event>, <date> - 7

Issues addressed by LCG that we didn’t integrate yet • #3931: Suggest a local proxy expiration check for WMS jobs § Proxy expiry check in the jobwrapper • #4318: Matchmaking policy for resubmitted jobs § Remove previously matched sites in resubmission § Now we remove only previously matched CEs • #4365: WL libraries/daemons must retry BDII queries § When the first query fails, it sleeps 5 seconds and retries; when the second attempt fails, it sleeps another 5 seconds and tries a third, final time • #4892: NS can (partially) crash with ‘unable to receive’ § uncaught exception • #5109: WMS daemon memory leaks § Memory leaks in JC, ldif 2 classad, LM, LB, NS § Fixes integrated only for JC and LM (as far as I know) <event>, <date> - 8

GLITE problems hopefully already addressed • The bugs below are still open in the Glite Savannah, but they have already been addressed • Still open because patches under test/still to be tested • #4588, #4630, #4631, #4893, #4938, #5071, #5089, #5094, #5115, #5325, #5361, #5378, #5383, #5489, #5582, #5802, #5832, #5869, #5938, #5977, #6059, #6081, #6083 <event>, <date> - 9

GLITE issues not addressed yet • #5278: lack of logging information for the workload_manager daemon • Discussed between Mario and Francesco. G • #5582: Unable to get voms proxy info from a voms proxy § Submitted by Peppe. Grid • #5833: all jobs in SUBMITTED after a job storm § SUBMITTED status for approx. 3 hours because most LB events did not arrive to the bookkeeping server in timely fashion § Being investigated by CESNET + Francesco. P • #5849: LCMAPS VOMS plugin 0. 0. 30 crashes on SLC 3 § Just reassigned to Valerio: still under investigation § Look like a duplicate of another bug • #6253: voms-proxy-info has a different format to grid-proxy-info § Main difference is in the expiration date § Valerio is almost ready to commit the changes <event>, <date> - 10