Ever tried Ever failed No matter Try Again

  • Slides: 41
Download presentation
+ Ever tried. Ever failed. No matter. Try Again. Fail again. Fail better. (S.

+ Ever tried. Ever failed. No matter. Try Again. Fail again. Fail better. (S. Beckett) CPass 0/CPass 1 on LHC 12 e/d/c Updated at 10: 00 on 20/08 C. Zampolli

+ LHC 12 f 8/20/12 C. Zampolli 2

+ LHC 12 f 8/20/12 C. Zampolli 2

+ Summary table – on 20/08 at ~ 10: 00 3 LHC 12 f

+ Summary table – on 20/08 at ~ 10: 00 3 LHC 12 f n 25 in logbook n n n Filters used: LHC 12 f, PHYSICS, Good Run, GRP ok at least one of [SDD, TPC, TRD, TOF, T 0] CPass 0, completed: n Snapshot: 26 (run 186687 – 2 min - marked as bad later) n Reco+Calib. Train: 26 n Merging+OCDB: 25 (186845 still running in the reco), 23 needed, 19 ok CPass 1, completed: n Snapshot: 19 n Reco+Calib. Train: 19 n Merging+OCDB: 19 C. Zampolli 8/20/12

+ Summary table – on 20/08 at ~ 10: 00 4 CPass 0 –

+ Summary table – on 20/08 at ~ 10: 00 4 CPass 0 – LHC 12 f n COSMICS: 0 failure expected n EMCAL/PHOS/MUON: 2 failure expected n No triggers: 0 failure expected (too short run) n EE/EV/Expired: 0 memory issue during the merging (under investigation) n Running: 0 n Others (detectors): 4 (186855 186816 186694 186687 n Successful: 19 n 19/(19+4) = 82. 6% success rate C. Zampolli 8/20/12

+ Summary table – on 20/08 at ~ 10: 00 5 CPass 0 –

+ Summary table – on 20/08 at ~ 10: 00 5 CPass 0 – LHC 12 f Failure reason TRD (4) Run Number 186687 2 min 186694 12 min 186816 6 min 186855 7 min All failures due to too short runs C. Zampolli 8/20/12

+ Summary table – on 20/08 at ~ 10: 00 6 CPass 0 –

+ Summary table – on 20/08 at ~ 10: 00 6 CPass 0 – LHC 12 f Failure reason EMCAL/MUON/P HOS runs (2) C. Zampolli Run Number 186805 186834 8/20/12

+ Summary table – on 20/08 at ~ 10: 00 7 CPass 1 –

+ Summary table – on 20/08 at ~ 10: 00 7 CPass 1 – LHC 12 f n Of the 19 successful runs: n 19 at CPass 1 reco+Calib. Train n 19 at CPass 1 merging+OCDB C. Zampolli 8/20/12

+ LHC 12 e 8/20/12 C. Zampolli 8

+ LHC 12 e 8/20/12 C. Zampolli 8

+ Summary table – on 20/08 at ~ 10: 00 9 LHC 12 e

+ Summary table – on 20/08 at ~ 10: 00 9 LHC 12 e n 27 in logbook n n n Filters used: LHC 12 e, PHYSICS, Good Run, GRP ok at least one of [SDD, TPC, TRD, TOF, T 0] CPass 0, completed: n Snapshot: 27 n Reco+Calib. Train: 27 n Merging+OCDB: 27, 21 useful, 11 ok CPass 1, completed: n Snapshot: 11 n Reco+Calib. Train: 11 n Merging+OCDB: 11 C. Zampolli 8/20/12

+ Summary table – on 20/08 at ~ 10: 00 10 CPass 0 –

+ Summary table – on 20/08 at ~ 10: 00 10 CPass 0 – LHC 12 e n COSMICS: 0 failure expected n EMCAL/PHOS/MUON: 6 failure expected n No triggers: 0 failure expected (too short run) n EE/EV/Expired: 0 memory issue during the merging (under investigation) n Running: 0 n Others (detectors): 10 n Successful: 11 n 11/(11+10) = 52. 4% success rate C. Zampolli 8/20/12

+ Summary table – on 20/08 at ~ 10: 00 11 CPass 0 –

+ Summary table – on 20/08 at ~ 10: 00 11 CPass 0 – LHC 12 e Failure reason Run Number 186428 (*) Failure reason TRD + T 0 (1) Run Number 186600 (**) 186429 (*) 186453 (*) 186456 (**) TRD (8) 186459 (**) Failure reason T 0 (1) Run Number 186601 186507 (*) 186508 (**) 186598 (*) § TRD: § (*) suffered from missing class (CSPI 8 WU-S-NOPF-ALL) in the configuration during data taking § § Fixed manually using CINT 8 WU-S-NOPF-ALL Cpass 0/1 should be re-run § (**) suffered from statistics – 186459 has CSPI 8 WU-S-NOPF-ALL but with zero triggers) § T 0 suffers from high background, but limits will be increased § Re-running will be ok (but CPass 1 should be triggered manually if Rev < Rev-23 will be used) C. Zampolli 8/20/12

+ Summary table – on 20/08 at ~ 10: 00 12 CPass 0 –

+ Summary table – on 20/08 at ~ 10: 00 12 CPass 0 – LHC 12 e Failure reason Run Number 186383 186405 EMCAL/MUON/P HOS runs (6) 186425 186448 186503 186589 C. Zampolli 8/20/12

+ Summary table – on 20/08 at ~ 10: 00 13 CPass 1 –

+ Summary table – on 20/08 at ~ 10: 00 13 CPass 1 – LHC 12 e n Of the 11 successful runs: n 11 at CPass 1 reco+Calib. Train n 11 at CPass 1 merging+OCDB C. Zampolli 8/20/12

+ Actions n n n CPass 0 completed on the available runs 10 runs

+ Actions n n n CPass 0 completed on the available runs 10 runs failed 2 T 0 (1 in common with TRD) n n n CPass 1 can be triggered manually at any time If re-running everything with Rev > Rev-23 (the next to come), everything should be ok, otherwise CPass 0 will fail again, and CPass 1 will be needed to be triggered manually 9 failed in TRD (1 in common with T 0) n n 5 runs had not the right class in the configuration n Fixed manually, waiting for OCDB update to re-run 4 runs have too little statistics n CPass 1 completed on the available runs n In summary, 6 runs can be recovered 14 C. Zampolli 8/20/12

+ LHC 12 d 8/20/12 C. Zampolli 15

+ LHC 12 d 8/20/12 C. Zampolli 15

+ Summary table – on 20/08 at ~ 10: 00 16 LHC 12 d

+ Summary table – on 20/08 at ~ 10: 00 16 LHC 12 d n 224 in logbook n n n Filters used: LHC 12 d, PHYSICS, Good Run, GRP ok at least one of [SDD, TPC, TRD, TOF, T 0] CPass 0 completed: n Snapshot: 220 n Reco+Calib. Train: 220 n Merging+OCDB: 220, 176 needed, 147 ok CPass 1 completed: n Snapshot: 148 (1 more than CPass 0, triggered manually after CPass 0) n Reco+Calib. Train: 148 n Merging+OCDB: 148, 148 needed C. Zampolli 8/20/12

+ Difference between logbook and snapshot in Mon. ALISA n In logbook, but not

+ Difference between logbook and snapshot in Mon. ALISA n In logbook, but not in Mon. ALISA: n n 17 184370 (EMCAL), 184645 (EMCAL), 185345 (ACORDE trigger), 185347 (ACORDE trigger), 185467 still in the migration process, checking with offline In Mon. ALISA but not in the logbook: n C. Zampolli 185190 (short run, the quality flag was changed) 8/20/12

+ Summary table – on 20/08 at ~ 10: 00 18 CPass 0 –

+ Summary table – on 20/08 at ~ 10: 00 18 CPass 0 – LHC 12 d n COSMICS: 9 failure expected n EMCAL/PHOS/MUON: 33 failure expected n No triggers: 2 failure expected (too short run) n EE/EV/Expired: 1 memory issue during the merging, but then merged manually n Running: 0 n Others (detectors): 28 n Successful: 147 n 147/(147+28+1) = 83. 5% success rate C. Zampolli 8/20/12

+ 19 Summary table – on 20/08 at ~ 10: 00 CPass 0 –

+ 19 Summary table – on 20/08 at ~ 10: 00 CPass 0 – LHC 12 d Failure reason Run Number 184880 184882 Failure reason TPC Gain Threshold (1) Run Number 185460 Also TRD 184885 184886 COSMICS (9) 184889 184910 184914 184918 186264 C. Zampolli 16 recovered rerunning with looser constraints for validation (run 185460 not retried, since it failed anyway in TRD) 8/20/12

+ Summary table – on 20/08 at ~ 10: 00 20 CPass 0 –

+ Summary table – on 20/08 at ~ 10: 00 20 CPass 0 – LHC 12 d Failure reason Run Number 185687 185768 185692 185775 185697 185698 T 0 (20) 185776 185778 185784 185699 185700 T 0 (20) 185701 185734 185735 185738 185756 185757 185764 C. Zampolli 185765 Hardware problem, fixed now 8/20/12

+ Summary table – on 20/08 at ~ 10: 00 21 CPass 0 –

+ Summary table – on 20/08 at ~ 10: 00 21 CPass 0 – LHC 12 d Failure reason EMCAL/MUON/P HOS runs (33) Run Number Failure reason Run Number 184443 185456 184481 185559 184663 185560 184664 185562 184709 185631 184716 185647 184719 184762 EMCAL/MUON/P HOS runs (33) 185677 185731 184780 185934 185024 185994 185148 185998 185186 186036 185341 186062 Failure reason Run Number 186159 186192 EMCAL/MUON/P HOS runs (33) 186224 186225 186232 186316 186063 C. Zampolli 8/20/12

+ Summary table – on 20/08 at ~ 10: 00 22 CPass 0 –

+ Summary table – on 20/08 at ~ 10: 00 22 CPass 0 – LHC 12 d Failure reason No triggers (2) Run Number 183915 185190 184190 185133 185378 TRD (8) 185460 Also TPC 185915 185916 186319 186320 EV (1) C. Zampolli 184673 Merged manually 8/20/12

+ Summary table – on 20/08 at ~ 10: 00 23 CPass 1 –

+ Summary table – on 20/08 at ~ 10: 00 23 CPass 1 – LHC 12 d n Of the 147 successful runs: n 148 at CPass 1 reco+Calib. Train n n 1 more than CPass 0 since CPass 0 was merged manually and the objects were uploaded manually in the OCDB (184673) 148 at CPass 1 merging+OCDB… n …of which 147 successful (ignore the red TPC color)… n . . . 1 failed in TRD (184145)… Different statistics for CPass 0 and CPass 1 § 480/480 chunks at CPass 0 § 472/480 chunks at CPass 1 C. Zampolli 8/20/12

+ TRD issue n 24 Due to a problem in the TRD reconstruction, some

+ TRD issue n 24 Due to a problem in the TRD reconstruction, some wrong OCDB entries were produced at CPass 0; it is not possible to get the correct ones without re-running CPass 0 n Some manual OCDB update is needed (after LHC 12 d is fully processed, ongoing for completed runs) n Then CPass 0/CPass 1 should be re-run with a Rev > Rev-18 n Will the failed runs be recovered? Waiting for experts’ reply C. Zampolli 8/20/12

+ Actions n n CPass 0 completed 20 runs failed at CPass 0 due

+ Actions n n CPass 0 completed 20 runs failed at CPass 0 due to T 0 hardware problems n n CPass 1 should be triggered manually for these runs n To be done after reprocessing, since now it would be useless (they all contain TRD) 8 runs failed in TRD n n 25 TRD needs LHC 12 d reprocessing (only for the runs it was in) will these 8 runs be recovered, or the failure reason is something that won't be fixed when re-running? run 184673 failed in CPass 0 merging (EV) and had CPass 0 entries uploaded produced manually by Raphaelle, and uploaded in the OCDB CPass 1 run, everything seems ok C. Zampolli 8/20/12

+ Actions – II n n CPass 1 completed 1 run failed in TRD

+ Actions – II n n CPass 1 completed 1 run failed in TRD due to lower statistics at CPass 1 reconstruction n n 26 should we try to recover it? will the TRD people fix it manually before VPass? Probably not needed, since we should re-run everything for TRD anyway In summary, we are waiting to re-run for TRD C. Zampolli 8/20/12

+ LHC 12 c 8/20/12 C. Zampolli 27

+ LHC 12 c 8/20/12 C. Zampolli 27

+ Summary table – on 20/08 at ~ 10: 00 28 LHC 12 c

+ Summary table – on 20/08 at ~ 10: 00 28 LHC 12 c n 205 in logbook n n n CPass 0 completed: n n Filters used: LHC 12 c, PHYSICS, Good Run, GRP ok at least one of [SDD, TPC, TRD, TOF, T 0] Do not coincide with those in Mon. ALISA, since runs were queued manually for CPass 0 Snapshot: 208, 1 should be ignored (179444) Reco+Calib. Train: 207 Merging+OCDB: 207, 109 needed, 93 ok CPass 1 completed: n n n C. Zampolli Snapshot: 93 Reco+Calib. Train: 93 Merging+OCDB: 93 8/20/12

+ Summary table – on 20/08 at ~ 10: 00 29 CPass 0 –

+ Summary table – on 20/08 at ~ 10: 00 29 CPass 0 – LHC 12 c n COSMICS: 37 failure expected n EMCAL/PHOS/MUON: 58 failure expected n No triggers: 3 failure expected (too short, or not the right trigger configuration) n EE/EV/Expired: 0 n Others (detectors): 16 n Successful: 93 n 93/(93+16) = 85. 3% success rate C. Zampolli 8/20/12

+ Summary table – on 20/08 at ~ 10: 00 30 CPass 0 –

+ Summary table – on 20/08 at ~ 10: 00 30 CPass 0 – LHC 12 c Failure reason COSMICS (37) C. Zampolli Run Number Failure reason Run Number 179658 179941 179712 179943 179713 179944 179717 179946 180987 179723 179948 180988 179725 179950 179730 179951 180992 179960 182749 179736 COSMICS (37) 179740 180164 179742 180979 179743 180980 179746 180981 179747 180983 179758 180984 179766 180985 Failure reason Run Number 180986 COSMICS (37) 180991 182750 8/20/12

+ Summary table – on 20/08 at ~ 10: 00 31 CPass 0 –

+ Summary table – on 20/08 at ~ 10: 00 31 CPass 0 – LHC 12 c Failure reason Run Number 179595 181026 179603 181040 179604 181046 179685 181328 179687 EMCAL/MUON/P HOS runs (58) Failure reason 180552 EMCAL/MUON/P HOS runs (58) 181339 181344 180559 181360 180616 181546 180643 181558 180644 180692 180704 C. Zampolli 8/20/12

+ Summary table – on 20/08 at ~ 10: 00 32 CPass 0 –

+ Summary table – on 20/08 at ~ 10: 00 32 CPass 0 – LHC 12 c Failure reason Run Number 181580 182316 181625 182403 181631 182405 181954 182410 181956 182449 181984 EMCAL/MUON/P HOS runs (58) Failure reason 182003 EMCAL/MUON/P HOS runs (58) 182451 182452 182094 182470 182100 182471 182103 182475 182195 182477 182198 182200 182226 C. Zampolli 8/20/12

+ Summary table – on 20/08 at ~ 10: 00 33 CPass 0 –

+ Summary table – on 20/08 at ~ 10: 00 33 CPass 0 – LHC 12 c Failure reason Run Number 182499 182502 182504 182609 182610 EMCAL/MUON/P HOS runs (60) 182612 182640 182641 182681 182712 182717 182721 C. Zampolli 8/20/12

+ Summary table – on 20/08 at ~ 10: 00 34 CPass 0 –

+ Summary table – on 20/08 at ~ 10: 00 34 CPass 0 – LHC 12 c Failure reason No triggers (3) Run Number Failure reason 180934 180716 (*) 181609 180717 (*) 182639 182325 (*) TRD (7) Failure reason Run Number 182509 (*) Run Number 182508 (*) 181617 (**) 182513 (*) 181618 (**) 182724 (*) 181619 (**) 181620 (**) TPC+TRD (9) 181652 (**) 181694 (**) 181698 (**) 181701 (**) (*) Low statistics, recoverable (*) Low statistics, not recoverable (**) No SSD/SDD number of contributors to Vertex Track = 0, TRD calibration failing, TRD fix in place; what about TPC? 181703 (**) C. Zampolli 8/20/12

+ Summary table – on 20/08 at ~ 10: 00 35 CPass 1 –

+ Summary table – on 20/08 at ~ 10: 00 35 CPass 1 – LHC 12 c n n Of the 93 successful runs: n 93 at CPass 1 reco+Calib. Train n 93 at CPass 1 merging+OCDB… n …of which 84 successful in CPass 1 (ignore the red TPC color)… n …and 9 failed in T 0, but are MUON runs – they should have not gone through (different Ali. Root, some changes in T 0) As soon as CPass 1 is completed, 1 week of time will be given for manual update. If too little (QM, holidays), we’ll increase it. Then, Vpass should start C. Zampolli 8/20/12

+ Actions n n 36 CPass 0 completed; 9 runs failed in TPC and

+ Actions n n 36 CPass 0 completed; 9 runs failed in TPC and TRD n TRD failed due to missing SDD/SSD; what about TPC? n TRD provided a code fix n would TPC try to recover these runs? n if both TPC and TRD can recover, should we wait for this and then run again CPass 0/CPass 1? 7 runs failed in TRD due to low statistics n n TRD can recover them manually, but no CPass 1 would be run after those how will the other detectors mark these runs? n TOF, T 0 bad n Mean Vertex good n TRP? TRD? n CPass 1 completed on the available runs n In summary, we need to know whether the 9 runs that failed in TRD+TPC should be reprocessed need a statement from TPC C. Zampolli 8/20/12

+ Further comments 8/20/12 C. Zampolli 37

+ Further comments 8/20/12 C. Zampolli 37

+ Interdependencies n 38 Under discussion: does EMCAL runs need calibration triggers? (PHOS does

+ Interdependencies n 38 Under discussion: does EMCAL runs need calibration triggers? (PHOS does not) n C. Zampolli Seems not! 8/20/12

+ Further issues n 39 Some reconstruction jobs fail with bad_alloc under investigation n

+ Further issues n 39 Some reconstruction jobs fail with bad_alloc under investigation n Grid tests with gdb ongoing not many information retrievable, the jobs ran successfully n Valgrind test ongoing did not show anything significant n Trying with Rev-21 on LHC 12 c, LHC 12 e n C. Zampolli Many errors, but FPE, not bad_alloc n stack trace available n I could not reproduce the problem, still investigating 8/20/12

+ PPass n 40 LHC 12 a and LHC 12 b Vpass validated ready

+ PPass n 40 LHC 12 a and LHC 12 b Vpass validated ready for Ppass n A patched Rev-16 was created to fix the TRD QA issue to be used to run Ppass n LHC 12 a completed, waiting for QA feedback n LHC 12 b completed, waiting for QA feedback C. Zampolli 8/20/12

+ Calibration of old data n 41 GRP/CTP/Aliases entries to be created, after defining

+ Calibration of old data n 41 GRP/CTP/Aliases entries to be created, after defining the classes to be used for the reconstruction n Might be needed to apply some downscale n min(max(nevents/10, 30000), nevents)/nevents, but we need to define nevents C. Zampolli 8/20/12