Status and plans of central CERN Linux facilities

  • Slides: 28
Download presentation
Status and plans of central CERN Linux facilities Thorsten Kleinwort IT/FIO-FS For PH/SFT Group

Status and plans of central CERN Linux facilities Thorsten Kleinwort IT/FIO-FS For PH/SFT Group 10. 06. 2005 1

Introduction • • 2 years ago: Post C 5 on migration from RH 6

Introduction • • 2 years ago: Post C 5 on migration from RH 6 to RH 7 Now: Migration from RH 7 to SLC 3 Achievements: Scalability Tools framework Scope Conclusions & outlook Thorsten Kleinwort IT/FIO/FS 2

Operating System Scalability Tools framework Scope Thorsten Kleinwort IT/FIO/FS 3

Operating System Scalability Tools framework Scope Thorsten Kleinwort IT/FIO/FS 3

Operating System • SLC 3 new default Operating System: • • • LXPLUS fully

Operating System • SLC 3 new default Operating System: • • • LXPLUS fully migrated, new h/w small rest on RH 7 o(5) LXBATCH 95% on SLC 3 Thorsten Kleinwort IT/FIO/FS 4

Thorsten Kleinwort IT/FIO/FS 5

Thorsten Kleinwort IT/FIO/FS 5

Operating System • SLC 3 new default platform: • • • LXPLUS fully migrated,

Operating System • SLC 3 new default platform: • • • LXPLUS fully migrated, new h/w small rest on RH 7 o(5) LXBATCH 95% on SLC 3 Rest to be migrated soon (even old h/w) Other clusters are migrated now as well: • LXGATE, LXBUILD, LXSERV, … Still some problems on special Clusters with special hardware (disk, tape server) Thorsten Kleinwort IT/FIO/FS 6

Operating System • Besides this ‘main’ OS, we have RHES: • • • Now

Operating System • Besides this ‘main’ OS, we have RHES: • • • Now supporting also other architectures: • • • RH ES 2 as well as RH ES 3 Needed for ORACLE ia 64, {xf 86_64} Needed for Service Challenge (CASTORGRID) No major problems, but: • • Additional work to provide/maintain those Minor differences, e. g. no AFS on ES, lilo on ia 64 Thorsten Kleinwort IT/FIO/FS 7

Operating System Scalability Tools framework Scope Thorsten Kleinwort IT/FIO/FS 8

Operating System Scalability Tools framework Scope Thorsten Kleinwort IT/FIO/FS 8

Scalability • • • Already reached 1000 nodes with RH 7 Automated node installation

Scalability • • • Already reached 1000 nodes with RH 7 Automated node installation Now at 2200 Quattor managed machines • • • Machine arrive in bunches o(100) Installed/stress-tested/moved Now, cluster management automated • • • Kernel upgrade on LXBATCH Vault move/renumbering Cluster upgrade to new version of OS Thorsten Kleinwort IT/FIO/FS 9

Cluster upgrade workflow Thorsten Kleinwort IT/FIO/FS 10

Cluster upgrade workflow Thorsten Kleinwort IT/FIO/FS 10

Scalability • Batch System LSF: • • • We are up to 50000 jobs

Scalability • Batch System LSF: • • • We are up to 50000 jobs in ~2500 slots So far o. k. , except for AFS copy -> NFS Infrastructure has to scale as well: • Power, cooling, space, network, … Thorsten Kleinwort IT/FIO/FS 11

Operating System Scalability Tools framework Scope Thorsten Kleinwort IT/FIO/FS 12

Operating System Scalability Tools framework Scope Thorsten Kleinwort IT/FIO/FS 12

Tools framework • • • We adapted EDG-WP 4 tools for our needs With

Tools framework • • • We adapted EDG-WP 4 tools for our needs With RH 7 still hybrid with old tools (SUE, ASIS), now clean on SLC 3 Improved and strengthened them in ELFms: • • Quattor, with SPMA and NCM configuration framework CDB configuration database with SQL interface Thorsten Kleinwort IT/FIO/FS 13

CDB: Web access tool Thorsten Kleinwort IT/FIO/FS 14

CDB: Web access tool Thorsten Kleinwort IT/FIO/FS 14

Tools framework • • • We adapted EDG-WP 4 tools for our needs With

Tools framework • • • We adapted EDG-WP 4 tools for our needs With RH 7 still hybrid with old tools (SUE, ASIS), now clean on SLC 3 Improved and strengthened them in ELFms: • • • Quattor, with SPMA and NCM configuration framework CDB configuration database with SQL interface Lemon Monitoring, including web interface Thorsten Kleinwort IT/FIO/FS 15

Lemon Start Page Thorsten Kleinwort IT/FIO/FS 16

Lemon Start Page Thorsten Kleinwort IT/FIO/FS 16

Lemon: E. g. LXBATCH Thorsten Kleinwort IT/FIO/FS 17

Lemon: E. g. LXBATCH Thorsten Kleinwort IT/FIO/FS 17

Tools framework • • • We adapted EDG-WP 4 tools for our needs With

Tools framework • • • We adapted EDG-WP 4 tools for our needs With RH 7 still hybrid with old tools (SUE, ASIS), now clean on SLC 3 Improved and strengthened them in ELFms: • • Quattor, with SPMA and NCM configuration framework CDB configuration database with SQL (r/o) interface Lemon Monitoring, including web interface LEAF, the SMS and HMS framework Thorsten Kleinwort IT/FIO/FS 18

LEAF: CCTracker & HMS Thorsten Kleinwort IT/FIO/FS 19

LEAF: CCTracker & HMS Thorsten Kleinwort IT/FIO/FS 19

Tools framework We rely on other tools/groups: • • • All Linux version come

Tools framework We rely on other tools/groups: • • • All Linux version come from Linux Support: • Need for new version increases their workload, too AIMS: our boot/installation service LANDB: now SOAP interface instead of the web Good collaboration The scale increases the pressure for robust tools on their side as well Thorsten Kleinwort IT/FIO/FS 20

Operating System Scalability Tools framework Scope Thorsten Kleinwort IT/FIO/FS 21

Operating System Scalability Tools framework Scope Thorsten Kleinwort IT/FIO/FS 21

Scope • • Original scope for the framework was LXBATCH/LXPLUS Framework adapted to other

Scope • • Original scope for the framework was LXBATCH/LXPLUS Framework adapted to other clusters: 1. 2. 3. LXGATE, LXBUILD: • Similar to LXPLUS/LXBATCH Disk Server, Tape Server: • Different h/w, larger variety, more special configuration Non – FIO cluster: • LXPARC • GM (EGEE): several clusters, used for tests and prototyping • GD (LCG test clusters) Thorsten Kleinwort IT/FIO/FS 22

Scope • These new clusters: • • Increase the scale even further Enlarge the

Scope • These new clusters: • • Increase the scale even further Enlarge the requirements for the tools, e. g. • New NCM components • New SMS/HMS states/workflows • Additional local users, … Come with new OS requirements, e. g. • RH ES for ORACLE servers • Ia 64 support for new CASTORGRID machines Proper testing for new s/w, OS, kernel has to be done on the cluster level Thorsten Kleinwort IT/FIO/FS 23

Fabric Services as part of the GRID • • • Additional LCG s/w was

Fabric Services as part of the GRID • • • Additional LCG s/w was incorporated into our Framework All SLC 3 LXBATCH nodes (>800 MHz) are WN CERN-PROD biggest site >1800 CPUs UI available on LXPLUS CE on LXGATE, 2 at the moment SE, cluster of 6 machines, running SRM and CASTORGRID All upgraded to LCG_2_4_0 Thorsten Kleinwort IT/FIO/FS 24

GOC Entry for CERN-PROD Thorsten Kleinwort IT/FIO/FS 25

GOC Entry for CERN-PROD Thorsten Kleinwort IT/FIO/FS 25

GRID Monitoring: Thorsten Kleinwort IT/FIO/FS 26

GRID Monitoring: Thorsten Kleinwort IT/FIO/FS 26

GRID Resource Infos: Thorsten Kleinwort IT/FIO/FS 27

GRID Resource Infos: Thorsten Kleinwort IT/FIO/FS 27

Conclusions & outlook Not only migrated to one new OS Next one: SLC 4

Conclusions & outlook Not only migrated to one new OS Next one: SLC 4 or SLC 5? • • • • Tools are ready, no major problems foreseen We have overcome scalability issues Prepared to go to LHC scale Tools: Gone from machine automation to cluster automation Improve usability Increase robustness Decrease necessary expert level Scope: From LXBATCH/LXPLUS to many different clusters How to manage non-FIO Clusters? Thorsten Kleinwort IT/FIO/FS 28