The GRID and the Linux Farm at the

  • Slides: 35
Download presentation
The GRID and the Linux Farm at the RCF HEPIX – Amsterdam May 19

The GRID and the Linux Farm at the RCF HEPIX – Amsterdam May 19 -23, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind, J. Smith, T. Throwe, T. Wlodek, D. Yu RHIC Computing Facility Brookhaven National Laboratory

Outline • Background • Hardware • Software • Security • GRID-like capabilities • Near-term

Outline • Background • Hardware • Software • Security • GRID-like capabilities • Near-term plans

Background • Used for mass processing of RHIC data • U. S. tier 1

Background • Used for mass processing of RHIC data • U. S. tier 1 Center for ATLAS • Listed as 3 rd largest cluster in http: //clusters. top 500. org • Currently staffed with 5 FTE

Growth of the Linux Farm

Growth of the Linux Farm

Hardware • Built with commercially available Intelbased servers • 1097 rack-mounted, dual CPU servers

Hardware • Built with commercially available Intelbased servers • 1097 rack-mounted, dual CPU servers • 917, 728 Spec. Int 2000 • Reliable (0. 0052 hardware failure/machinemonth—about 6 failures/month)

Breakdown of Hardware Failures

Breakdown of Hardware Failures

The Linux Farm in the RCF

The Linux Farm in the RCF

The IBM servers

The IBM servers

The VA Linux servers

The VA Linux servers

Hardware (cont. ) Brand CPU RAM Storage Quantity VA Linux 450 MHz 0. 5

Hardware (cont. ) Brand CPU RAM Storage Quantity VA Linux 450 MHz 0. 5 -1 GB 9 -120 GB 154 VA Linux 700 MHz 0. 5 GB 9 -36 GB 48 VA Linux 800 MHz 0. 5 -1 GB 18 -480 GB 168 IBM 1. 0 GHz 0. 5 -1 GB 18 -144 GB 315 IBM 1. 4 GHz 1 GB 36 -144 GB 160 IBM 2. 4 GHz 1 GB 240 GB 252

Software • Red Hat 7. 2 (RHIC) and 7. 3 (ATLAS) • Image installed

Software • Red Hat 7. 2 (RHIC) and 7. 3 (ATLAS) • Image installed with Kickstart • Support for an array of compilers (gcc, PGI, Intel) and debuggers (gdb, Total. View, Intel) • Support for network file systems (AFS, NFS)

Software (cont. ) • Support for LSF and RCF-designed batch software • Mix of

Software (cont. ) • Support for LSF and RCF-designed batch software • Mix of open-source, RCF-built and vendorsupplied software to monitor and control hardware, software and infrastructure • Scalability an important operational requirement

Linux Farm Monitoring

Linux Farm Monitoring

Batch Control & Monitoring

Batch Control & Monitoring

Linux Farm Usage

Linux Farm Usage

Remote Power Management

Remote Power Management

Infrastructure Monitoring

Infrastructure Monitoring

Security • Firewall to minimize unauthorized access • User access via SSH through securityenhanced

Security • Firewall to minimize unauthorized access • User access via SSH through securityenhanced gateway systems • Most servers closed to direct external access • Other security measures being developed (Kerberos 5? )

Security (cont. )

Security (cont. )

GRID-like capabilities • Ganglia (monitoring & job scheduler) • Condor (batch software) • GLOBUS

GRID-like capabilities • Ganglia (monitoring & job scheduler) • Condor (batch software) • GLOBUS & LSF batch

Ganglia • Open-source monitoring software (http: //sourceforge. net/projects/ganglia) • Can create federation of clusters

Ganglia • Open-source monitoring software (http: //sourceforge. net/projects/ganglia) • Can create federation of clusters • Historical data information • Can be used as job scheduler in GRID-like environment

Ganglia (cont. ) • Web interface • Prototype running for RHIC and ATLAS •

Ganglia (cont. ) • Web interface • Prototype running for RHIC and ATLAS • Scalability issues • Downside – cannot (yet) restrict data access easily, not easily customized

Ganglia in the GRID

Ganglia in the GRID

Ganglia at the RCF

Ganglia at the RCF

Ganglia at the RCF (1)

Ganglia at the RCF (1)

Ganglia at the RCF (2)

Ganglia at the RCF (2)

Condor • Open-source software (http: //www. cs. wisc. edu/condor) • Supported by Univ. of

Condor • Open-source software (http: //www. cs. wisc. edu/condor) • Supported by Univ. of Wisconsin • Full-feature batch software with jobqueuing mechanism, scheduling policy, priority scheme, checkpoint capability, resource monitoring & management • Can connect together multiple remote clusters

Condor (cont. ) • Interface to GLOBUS • Prototype for Linux Farm batch access

Condor (cont. ) • Interface to GLOBUS • Prototype for Linux Farm batch access in GRID-like environment • Scalability -- not yet tested in very largescale environment? • MDS-compatible in RCF environment?

Condor & the batch software

Condor & the batch software

GLOBUS & LSF • GLOBUS tools allow remote users to submit jobs on local

GLOBUS & LSF • GLOBUS tools allow remote users to submit jobs on local cluster • Prototypes at the RCF • Gatekeeper acts as interface between GLOBUS and local batch system • LSF job submitted from gatekeeper

The USATLAS GRID Testbed

The USATLAS GRID Testbed

STAR GRID Testbed

STAR GRID Testbed

GLOBUS & LSF (cont. )

GLOBUS & LSF (cont. )

GLOBUS & LSF (cont. )

GLOBUS & LSF (cont. )

Near-term plans • Deploy ganglia on a wide-scale basis • Roll out Condor as

Near-term plans • Deploy ganglia on a wide-scale basis • Roll out Condor as part of new batch software • Upgrade to LSF v. 5. x – GRID-like features • Upgrade OS to Red. Hat 8? Red. Hat 9? • Security issues