Ever tried Ever failed No matter Try Again

  • Slides: 46
Download presentation
+ Ever tried. Ever failed. No matter. Try Again. Fail again. Fail better. (S.

+ Ever tried. Ever failed. No matter. Try Again. Fail again. Fail better. (S. Beckett) CPass 0/CPass 1 on LHC 12 f/e/d/c Updated at 10: 00 on 27/08 C. Zampolli

+ Some diagnostics n n 2 During the last 2 months of CPass 0/CPass

+ Some diagnostics n n 2 During the last 2 months of CPass 0/CPass 1 processing, (quite) some manual intervention was needed n Fixing steering macros/scripts n Restarting CPass 0 and/or CPass 1 n Triggering CPass 0 and/or CPass 1 manually Main reasons (to my memory… I might forget something) n Wrong Add. Task. TPCcalib. C committed to the release by mistake during synchronization n Merging of syswatch trees not properly tested and consuming too much memory n TPC wrong OCDB update in make. OCDB. C macro for CPass 1 n Wrong TPC gain threshold used for validation C. Zampolli 8/20/12

+ Some diagnostics – II n Reprocessing of LHC 12 d due to a

+ Some diagnostics – II n Reprocessing of LHC 12 d due to a bug in the TRD reconstruction n Re-reprocessing of LHC 12 d due to a problem with TRD code in Rev-23 n Some LHC 12 e runs to be reprocessed after a fix in the aliases files due to “miscommunication” (mis = missing + wrong) between TRD, RC, Trigger, calibration n CPass 1 manual triggering for runs failed in T 0 at CPass 0 (1 done, 20 to be done) n CPass 1 manual triggering for a run for which CPass 0 was merged manually (Raphaelle) n CNAF disk full n ALICE: : CERN: : T 0 issue C. Zampolli 3 8/20/12

+ Two more comments… n As already said in July, no modification in Ali.

+ Two more comments… n As already said in July, no modification in Ali. Root that may affect the calibration should be requested to be ported to the Release if not properly tested in the calibration train on the grid n n 4 I cannot know whether changes may affect the calibration, the detector experts should Since apparently it is not enough to show updates on Monday Offline, Tuesday RC, Thursday Offline Calibration Readiness and Friday Calibration usual meetings, I think it would be important that: n One person representing all the detectors taking part in CPass 0/CPass 1 should always be present at the calibration meetings n If the direct responsible(s) is not available, someone representing the corresponding detector should anyway participate, to propagate the information discussed there. C. Zampolli 8/20/12

+ LHC 12 f 8/20/12 C. Zampolli 5

+ LHC 12 f 8/20/12 C. Zampolli 5

+ Summary table – on 27/08 at ~ 10: 00 6 LHC 12 f

+ Summary table – on 27/08 at ~ 10: 00 6 LHC 12 f n 64 in logbook n n n Filters used: LHC 12 f, PHYSICS, Good Run, GRP ok at least one of [SDD, TPC, TRD, TOF, T 0], with Beam CPass 0: n Snapshot: 63 n Reco+Calib. Train: 63 n Merging+OCDB: 61 CPass 1: n Snapshot: 41 n Reco+Calib. Train: 41 n Merging+OCDB: 37 C. Zampolli 8/20/12

+ Summary table – on 27/08 at ~ 10: 00 7 CPass 0 –

+ Summary table – on 27/08 at ~ 10: 00 7 CPass 0 – LHC 12 f n COSMICS: 0 failure expected n EMCAL/PHOS/MUON: 13 failure expected n No triggers: 0 failure expected (too short run) n EE/EV/Expired: 0 memory issue during the merging (under investigation) n Running: 1 n Others (detectors): 5 (but all short runs) n Successful: 42 n 42/(42+5) = 89. 4% success rate C. Zampolli 8/20/12

+ Summary table – on 27/08 at ~ 10: 00 8 CPass 0 –

+ Summary table – on 27/08 at ~ 10: 00 8 CPass 0 – LHC 12 f Failure reason TRD (5) Run Number 186694 12 min, 4874 events/ 43825 tracks 186816 6 min, 11111 events/ 114733 tracks 186855 7 min, 11242 events/ 107505 tracks 187147 7 min, 11089 events/ 138408 tracks 187148 5 min, 11138 events/ 154946 tracks All failures due to too short runs (number of events/tracks in terms of events used by TRD calibration) C. Zampolli 8/20/12

+ Summary table – on 27/08 at ~ 10: 00 9 CPass 0 –

+ Summary table – on 27/08 at ~ 10: 00 9 CPass 0 – LHC 12 f Failure reason Run Number 186805 186834 186926 186962 186980 EMCAL/MUON/P HOS runs (13) 186981 187046 187064 187081 187117 187133 187198 C. Zampolli 8/20/12

+ Summary table – on 27/08 at ~ 10: 00 10 CPass 0 –

+ Summary table – on 27/08 at ~ 10: 00 10 CPass 0 – LHC 12 f Failure reason TOF, T 0, GRP (1) Run Number 187203 Merge. log file looks suspicious: seems a problem with the grid, to be retried C. Zampolli 8/20/12

+ Summary table – on 27/08 at ~ 10: 00 11 CPass 1 –

+ Summary table – on 27/08 at ~ 10: 00 11 CPass 1 – LHC 12 f n Of the 44 successful runs: n 41 at CPass 1 reco+Calib. Train n 37 at CPass 1 merging+OCDB C. Zampolli 8/20/12

+ LHC 12 e 8/20/12 C. Zampolli 12

+ LHC 12 e 8/20/12 C. Zampolli 12

+ Summary table – on 27/08 at ~ 10: 00 13 LHC 12 e

+ Summary table – on 27/08 at ~ 10: 00 13 LHC 12 e n 27 in logbook n n n Filters used: LHC 12 e, PHYSICS, Good Run, GRP ok at least one of [SDD, TPC, TRD, TOF, T 0] CPass 0, completed: n Snapshot: 27 n Reco+Calib. Train: 27 n Merging+OCDB: 27, 21 useful, 11 ok CPass 1, completed: n Snapshot: 14 n Reco+Calib. Train: 13 n Merging+OCDB: 11 C. Zampolli 8/20/12

+ Summary table – on 27/08 at ~ 10: 00 14 CPass 0 –

+ Summary table – on 27/08 at ~ 10: 00 14 CPass 0 – LHC 12 e n COSMICS: 0 failure expected n EMCAL/PHOS/MUON: 6 failure expected n No triggers: 0 failure expected (too short run) n EE/EV/Expired: 0 memory issue during the merging (under investigation) n Running: 0 n Others (detectors): 10: 3 recovered so far for TRD, 7 remaining n Successful: 11 became 14 n 11/(11+10) = 52. 4% success rate became: 14/(14+7) = 66. 6% C. Zampolli 8/20/12

+ 15 Summary table – on 27/08 at ~ 10: 00 CPass 0 –

+ 15 Summary table – on 27/08 at ~ 10: 00 CPass 0 – LHC 12 e Failure reason Run Number 186428 (*) 186429 (*) 186453 (*) TRD (8) 186456 (**) 14 min, events Failure reason 186459 (**) 14 min, events TRD + T 0 (1) 14 min, events Failure reason Run Number 186600 (**) 186507 (*) 186508 (**) 186598 (*) T 0 (1) Run Number 186601 § TRD: § (*) suffered from missing class (CSPI 8 WU-S-NOPF-ALL) in the configuration during data taking § § Fixed manually using CINT 8 WU-S-NOPF-ALL Cpass 0/1 should be re-run § (**) suffered from statistics – 186459 has CSPI 8 WU-S-NOPF-ALL but with zero triggers) § T 0 suffers from high background, but limits will be increased § Re-running will be ok (but CPass 1 should be triggered manually if Rev < Rev-23 will be used) C. Zampolli 8/20/12

+ Summary table – on 27/08 at ~ 10: 00 16 CPass 0 –

+ Summary table – on 27/08 at ~ 10: 00 16 CPass 0 – LHC 12 e - REPROCESSING Failure reason Run Number 186428 186429 TRD (5) 186453 186507 Failed (statistics) Ok Problem 186598 Failure reason T 0 (1) C. Zampolli Run Number 186601 CPass 1 re-run! Failing again in CPass 1 as expected, but T 0 experts already fixed the OCDB 8/20/12

+ Summary table – on 27/08 at ~ 10: 00 17 CPass 0 –

+ Summary table – on 27/08 at ~ 10: 00 17 CPass 0 – LHC 12 e Failure reason Run Number 186383 186405 EMCAL/MUON/P HOS runs (6) 186425 186448 186503 186589 C. Zampolli 8/20/12

+ Summary table – on 27/08 at ~ 10: 00 18 CPass 1 –

+ Summary table – on 27/08 at ~ 10: 00 18 CPass 1 – LHC 12 e n Of the 14 successful runs, 15 at CPass 1 ( one more since 186601 was inserted manually!): n 15 at the snapshot 15 at CPass 1 reco+Calib. Train n 15 at CPass 1 merging+OCDB C. Zampolli 8/20/12

+ Actions n n 19 COMPLETED Since the period was too short, the manual

+ Actions n n 19 COMPLETED Since the period was too short, the manual update should be done together with LHC 12 d waiting for this period to be completed C. Zampolli 8/20/12

+ LHC 12 d 8/20/12 C. Zampolli 20

+ LHC 12 d 8/20/12 C. Zampolli 20

+ Summary table – on 27/08 at ~ 10: 00 21 LHC 12 d

+ Summary table – on 27/08 at ~ 10: 00 21 LHC 12 d n 224 in logbook n n n Filters used: LHC 12 d, PHYSICS, Good Run, GRP ok at least one of [SDD, TPC, TRD, TOF, T 0] CPass 0 completed: n Snapshot: 220 n Reco+Calib. Train: 220 n Merging+OCDB: 220, 176 needed, 147 ok CPass 1 completed: n Snapshot: 148 (1 more than CPass 0, triggered manually after CPass 0) n Reco+Calib. Train: 148 n Merging+OCDB: 148, 148 needed C. Zampolli 8/20/12

+ Difference between logbook and snapshot in Mon. ALISA n In logbook, but not

+ Difference between logbook and snapshot in Mon. ALISA n In logbook, but not in Mon. ALISA: n n 22 184370 (EMCAL), 184645 (EMCAL), 185345 (ACORDE trigger), 185347 (ACORDE trigger), 185467 still in the migration process, checking with offline In Mon. ALISA but not in the logbook: n C. Zampolli 185190 (short run, the quality flag was changed) 8/20/12

+ Summary table – on 27/08 at ~ 10: 00 23 CPass 0 –

+ Summary table – on 27/08 at ~ 10: 00 23 CPass 0 – LHC 12 d n COSMICS: 9 failure expected n EMCAL/PHOS/MUON: 33 failure expected n No triggers: 2 failure expected (too short run) n EE/EV/Expired: 1 memory issue during the merging, but then merged manually n Running: 0 n Others (detectors): 28 n Successful: 147 n 147/(147+28+1) = 83. 5% success rate C. Zampolli 8/20/12

+ 24 Summary table – on 27/08 at ~ 10: 00 CPass 0 –

+ 24 Summary table – on 27/08 at ~ 10: 00 CPass 0 – LHC 12 d Failure reason Run Number 184880 184882 Failure reason TPC Gain Threshold (1) Run Number 185460 Also TRD 184885 184886 COSMICS (9) 184889 184910 184914 184918 186264 C. Zampolli 16 recovered rerunning with looser constraints for validation (run 185460 not retried, since it failed anyway in TRD) 8/20/12

+ Summary table – on 27/08 at ~ 10: 00 25 CPass 0 –

+ Summary table – on 27/08 at ~ 10: 00 25 CPass 0 – LHC 12 d Failure reason Run Number 185687 185768 185692 185775 185697 185698 T 0 (20) 185776 185778 185784 185699 185700 T 0 (20) 185701 185734 185735 185738 185756 185757 185764 C. Zampolli 185765 Hardware problem, fixed now 8/20/12

+ Summary table – on 27/08 at ~ 10: 00 26 CPass 0 –

+ Summary table – on 27/08 at ~ 10: 00 26 CPass 0 – LHC 12 d Failure reason EMCAL/MUON/P HOS runs (33) Run Number Failure reason Run Number 184443 185456 184481 185559 184663 185560 184664 185562 184709 185631 184716 185647 184719 184762 EMCAL/MUON/P HOS runs (33) 185677 185731 184780 185934 185024 185994 185148 185998 185186 186036 185341 186062 Failure reason Run Number 186159 186192 EMCAL/MUON/P HOS runs (33) 186224 186225 186232 186316 186063 C. Zampolli 8/20/12

+ Summary table – on 27/08 at ~ 10: 00 27 CPass 0 –

+ Summary table – on 27/08 at ~ 10: 00 27 CPass 0 – LHC 12 d Failure reason No triggers (2) Run Number 183915 185190 184190 185133 185378 TRD (8) 185460 Also TPC 185915 185916 186319 186320 EV (1) C. Zampolli 184673 Merged manually 8/20/12

+ Summary table – on 27/08 at ~ 10: 00 28 CPass 1 –

+ Summary table – on 27/08 at ~ 10: 00 28 CPass 1 – LHC 12 d n Of the 147 successful runs: n 148 at CPass 1 reco+Calib. Train n n 1 more than CPass 0 since CPass 0 was merged manually and the objects were uploaded manually in the OCDB (184673) 148 at CPass 1 merging+OCDB… n …of which 147 successful (ignore the red TPC color)… n . . . 1 failed in TRD (184145)… Different statistics for CPass 0 and CPass 1 § 480/480 chunks at CPass 0 § 472/480 chunks at CPass 1 C. Zampolli 8/20/12

+ TRD issue n 29 Due to a problem in the TRD reconstruction, some

+ TRD issue n 29 Due to a problem in the TRD reconstruction, some wrong OCDB entries were produced at CPass 0; it is not possible to get the correct ones without re-running CPass 0 n Some manual OCDB update is needed (after LHC 12 d is fully processed, ongoing for completed runs) DONE n Then CPass 0/CPass 1 should be re-run with a Rev > Rev-18 n C. Zampolli n Rev-23 (the latest) was used n Changed in TRD code made the calibration not work properly n More tests, new re-running with Rev-22 Will the failed runs be recovered? Waiting for experts’ reply still not known 8/20/12

+ Actions n n CPass 0 completed 20 runs failed at CPass 0 due

+ Actions n n CPass 0 completed 20 runs failed at CPass 0 due to T 0 hardware problems n n 30 CPass 1 should be triggered manually for these runs n To be done after reprocessing, since now it would be useless (they all contain TRD) Re-running with Rev-22… ongoing C. Zampolli 8/20/12

+ LHC 12 c 8/20/12 C. Zampolli 31

+ LHC 12 c 8/20/12 C. Zampolli 31

+ Summary table – on 27/08 at ~ 10: 00 32 LHC 12 c

+ Summary table – on 27/08 at ~ 10: 00 32 LHC 12 c n 205 in logbook n n n CPass 0 completed: n n Filters used: LHC 12 c, PHYSICS, Good Run, GRP ok at least one of [SDD, TPC, TRD, TOF, T 0] Do not coincide with those in Mon. ALISA, since runs were queued manually for CPass 0 Snapshot: 208, 1 should be ignored (179444) Reco+Calib. Train: 207 Merging+OCDB: 207, 109 needed, 93 ok CPass 1 completed: n n n C. Zampolli Snapshot: 93 Reco+Calib. Train: 93 Merging+OCDB: 93 8/20/12

+ Summary table – on 27/08 at ~ 10: 00 33 CPass 0 –

+ Summary table – on 27/08 at ~ 10: 00 33 CPass 0 – LHC 12 c n COSMICS: 37 failure expected n EMCAL/PHOS/MUON: 58 failure expected n No triggers: 3 failure expected (too short, or not the right trigger configuration) n EE/EV/Expired: 0 n Others (detectors): 16 n Successful: 93 n 93/(93+16) = 85. 3% success rate C. Zampolli 8/20/12

+ Summary table – on 27/08 at ~ 10: 00 34 CPass 0 –

+ Summary table – on 27/08 at ~ 10: 00 34 CPass 0 – LHC 12 c Failure reason COSMICS (37) C. Zampolli Run Number Failure reason Run Number 179658 179941 179712 179943 179713 179944 179717 179946 180987 179723 179948 180988 179725 179950 179730 179951 180992 179960 182749 179736 COSMICS (37) 179740 180164 179742 180979 179743 180980 179746 180981 179747 180983 179758 180984 179766 180985 Failure reason Run Number 180986 COSMICS (37) 180991 182750 8/20/12

+ Summary table – on 27/08 at ~ 10: 00 35 CPass 0 –

+ Summary table – on 27/08 at ~ 10: 00 35 CPass 0 – LHC 12 c Failure reason Run Number 179595 181026 179603 181040 179604 181046 179685 181328 179687 EMCAL/MUON/P HOS runs (58) Failure reason 180552 EMCAL/MUON/P HOS runs (58) 181339 181344 180559 181360 180616 181546 180643 181558 180644 180692 180704 C. Zampolli 8/20/12

+ Summary table – on 27/08 at ~ 10: 00 36 CPass 0 –

+ Summary table – on 27/08 at ~ 10: 00 36 CPass 0 – LHC 12 c Failure reason Run Number 181580 182316 181625 182403 181631 182405 181954 182410 181956 182449 181984 EMCAL/MUON/P HOS runs (58) Failure reason 182003 EMCAL/MUON/P HOS runs (58) 182451 182452 182094 182470 182100 182471 182103 182475 182195 182477 182198 182200 182226 C. Zampolli 8/20/12

+ Summary table – on 27/08 at ~ 10: 00 37 CPass 0 –

+ Summary table – on 27/08 at ~ 10: 00 37 CPass 0 – LHC 12 c Failure reason Run Number 182499 182502 182504 182609 182610 EMCAL/MUON/P HOS runs (60) 182612 182640 182641 182681 182712 182717 182721 C. Zampolli 8/20/12

+ Summary table – on 27/08 at ~ 10: 00 38 CPass 0 –

+ Summary table – on 27/08 at ~ 10: 00 38 CPass 0 – LHC 12 c Failure reason No triggers (3) Run Number Failure reason 180934 180716 (*) 181609 180717 (*) 182639 182325 (*) TRD (7) Failure reason Run Number 182509 (*) Run Number 182508 (*) 181617 (**) 182513 (*) 181618 (**) 182724 (*) 181619 (**) 181620 (**) TPC+TRD (9) 181652 (**) 181694 (**) 181698 (**) 181701 (**) (*) Low statistics, recoverable (*) Low statistics, not recoverable (**) No SSD/SDD number of contributors to Vertex Track = 0, TRD calibration failing, TRD fix in place; what about TPC? 181703 (**) C. Zampolli 8/20/12

+ Summary table – on 27/08 at ~ 10: 00 39 CPass 1 –

+ Summary table – on 27/08 at ~ 10: 00 39 CPass 1 – LHC 12 c n n Of the 93 successful runs: n 93 at CPass 1 reco+Calib. Train n 93 at CPass 1 merging+OCDB… n …of which 84 successful in CPass 1 (ignore the red TPC color)… n …and 9 failed in T 0, but are MUON runs – they should have not gone through (different Ali. Root, some changes in T 0) As soon as CPass 1 is completed, 1 week of time will be given for manual update. If too little (QM, holidays), we’ll increase it. Then, Vpass should start C. Zampolli 8/20/12

+ Actions n n CPass 0 completed; 9 runs failed in TPC and TRD

+ Actions n n CPass 0 completed; 9 runs failed in TPC and TRD n n 40 Not recoverable, no CPass 1 7 runs failed in TRD due to low statistics n n TRD can recover them manually, but no CPass 1 would be run after those how will the other detectors mark these runs? n TOF, T 0 bad n Mean Vertex good n TRP? TRD? n CPass 1 completed on the available runs n In summary, ready for the manual update window 1 week for the manual update announced: deadline on Friday 31 Aug (so far, eventually extended to Monday) C. Zampolli 8/20/12

+ Further comments 8/20/12 C. Zampolli 41

+ Further comments 8/20/12 C. Zampolli 41

+ Interdependencies n 42 Under discussion: does EMCAL runs need calibration triggers? (PHOS does

+ Interdependencies n 42 Under discussion: does EMCAL runs need calibration triggers? (PHOS does not) n C. Zampolli Seems not! 8/20/12

+ Further issues n 43 Some reconstruction jobs fail with bad_alloc under investigation n

+ Further issues n 43 Some reconstruction jobs fail with bad_alloc under investigation n Grid tests with gdb ongoing not many information retrievable, the jobs ran successfully n Valgrind test ongoing did not show anything significant n Trying with Rev-21 on LHC 12 c, LHC 12 e n C. Zampolli Many errors, but FPE, not bad_alloc n stack trace available n I could not reproduce the problem, still investigating 8/20/12

+ PPass n 44 LHC 12 a and LHC 12 b Vpass validated ready

+ PPass n 44 LHC 12 a and LHC 12 b Vpass validated ready for Ppass n A patched Rev-16 was created to fix the TRD QA issue to be used to run Ppass n LHC 12 a completed, QA feedback last week n LHC 12 b completed, QA feedback last week C. Zampolli 8/20/12

+ Calibration of old data n 45 GRP/CTP/Aliases entries to be created, after defining

+ Calibration of old data n 45 GRP/CTP/Aliases entries to be created, after defining the classes to be used for the reconstruction n Might be needed to apply some downscale n min(max(nevents/10, 30000), nevents)/nevents, but we need to define nevents C. Zampolli 8/20/12

+ p. A n Since MB will be the main trigger, we propose to

+ p. A n Since MB will be the main trigger, we propose to use that and downscale. n n 46 For the p. A pilot run, all data are asked to be reconstructed, keeping ESDs, friends, and ITS Rec. Points Tests on the LHC 11 f 2 ongoing feedback will be asked C. Zampolli 8/20/12