Valencia Cluster status Gang Qin Nov 25 2011

  • Slides: 7
Download presentation
Valencia Cluster status —— Gang Qin Nov. 25 2011

Valencia Cluster status —— Gang Qin Nov. 25 2011

New Items n condor & proof Monitoring n n n Service Availability Monitoring(SAM). Every

New Items n condor & proof Monitoring n n n Service Availability Monitoring(SAM). Every condor slave in the cluster will receive a test job every hour, results will be merged into web monitoring page, alarm mail will be sent out if any of them failed. n Similar idea for proof n No priority for SAM jobs; n Add system load while the system load is already quite high NFS failing on some WNs n Some jobs will fail directly n Popular problem with NFS, usually fixed by crond. (2)

Items with improvement n condor upgrade on valtical cluster n n n Configure files

Items with improvement n condor upgrade on valtical cluster n n n Configure files for condor master & slave are different, to be uniformed in the furture in scheduled maintenance. Optimization of crontab to restart the xrtood & proofd sevice n n condor-7. 6. 4 -1. x 86_64 has been installed on all machines in valtical cluster, twiki updated as well, to run condor commands user doesn’t need to do any speical enviroment setting Deployed to all machines in the valtical cluster, . High CPU Overload (>100) on Valtical 00 (NFS server) n Caused by xrootd, around 50% of the xrootd data are saved on this machine (12 TB) n Possible solution n Data rebalance between data servers, which means adding more disk to other WNs, this needs to change the Chasis, Carlos has ordered one and it has come today. Further tests will be organized. Filesize regulation: currently the size of xrootd files in the cluster jumps from ~20 M to ~1 G, a general idea is that disk I/O will benefit from larger size file, tests to be done. Adding RAID controller at the begging? (not possible now) (3)

Load Balancing n Balance data importing and proof jobs n n When importing data

Load Balancing n Balance data importing and proof jobs n n When importing data to the cluster with xrdcp, proof jobs will be very slow or sometimes crashed Coordinate the data importing & proof job running time? Data importing before 9: 00 and after 20: 00 ? Send mail to the mailing list when data importing starts and ends? Load balance between Condor & Proof in the cluster n Force condor daemon on client unable to get started when non-condor cpu load > 0. 3 (further tests needed) (4)

Pending Items n Evaluate filesystem migration from XRootd to EOS n n To be

Pending Items n Evaluate filesystem migration from XRootd to EOS n n To be done. Find cause of regular IOwait problems in NFS share n Problem is not on NFS service, but still we can do some NFS optimization n Nfsd number adjustment: 8 fine Linux kernal optimization: no big improvement observed with an instant check, longer-time tests to be done. Better use NFS? n n disk I/O situation will be even worse when xrootd is accessing files on the NFS server. Separate WN , NFS & UI with limited machines? (5)

Finished old items n Revive valtical 15 as SLC 5 workstation n Done and

Finished old items n Revive valtical 15 as SLC 5 workstation n Done and now it’s providing NFS service to the whole cluster (/data 2, /data 3, /data 4) (6)

Thank you

Thank you