RAL Tier 1A Report Hep Sys Man July

  • Slides: 21
Download presentation
RAL Tier 1/A Report Hep. Sys. Man - July 2004 Martin Bly / Andrew

RAL Tier 1/A Report Hep. Sys. Man - July 2004 Martin Bly / Andrew Sansum Martin Bly RAL Tier 1/A

Overview • • Hardware Network Experiences / Challenges Management issues 1/2 July 2004 Martin

Overview • • Hardware Network Experiences / Challenges Management issues 1/2 July 2004 Martin Bly RAL Tier 1/A 2

Tier 1 in GRIDPP 2 (2004 -2007) • The Tier-1 Centre will provide GRIDPP

Tier 1 in GRIDPP 2 (2004 -2007) • The Tier-1 Centre will provide GRIDPP 2 with a large computing resource of a scale and quality that can be categorised as an LCG Regional Computing Centre • January 2004 – GRIDPP 2 confirm RAL to be host for Tier 1 Service – GRIDPP 2 to commence September 2004 • Tier 1 Hardware budget: – £ 2. 3 M over 3 years • Staff – Increase from 12. 1 to 16. 5 by September 1/2 July 2004 Martin Bly RAL Tier 1/A 3

Current Tier 1 Hardware • • • CPU – 350 dual Processor Intel –

Current Tier 1 Hardware • • • CPU – 350 dual Processor Intel – PIII and Xeon servers mainly rack mounts – About 400 KSI 2 K – Red. Hat 7. 3 – P 2/450 tower units decommissioned April 04 – RH 72 and Solaris batch services to be phased out this year Disk Service – mainly “standard” configuration – Dual Processor Server – Dual channel SCSI interconnect – External IDE/SCSI RAID arrays (Accusys and Infortrend) – ATA drives (mainly Maxtor) – About 80 TB disk – Cheap and (fairly) cheerful Tape Service – STK Powderhorn 9310 silo with 8 9940 B drives 1/2 July 2004 Martin Bly RAL Tier 1/A 4

New Hardware • 256 x dual Xeon HT/2800 GHz@533 MHz – 2 GB RAM

New Hardware • 256 x dual Xeon HT/2800 GHz@533 MHz – 2 GB RAM (32 with 4 GB RAM), 120 GB HDD, 1 Gb NIC: 8 racks. • 20 disk servers with two 4 TB IDE/SCSI arrays: 5 racks – Infortrend Eon. Store A 16 U-G 1 A units, each with 16 x WD 250 GB SATA HDD – 4 TB/array raw capacity – Servers: dual Xeon HT/2800@533 MHz, 2 GB RAM, dual 120 GB SATA system disks, dual 1 Gb/s NIC – 160 Tb raw, ~140 TB available (RAID 5) • Delivered June 15 th, now running commissioning tests 1/2 July 2004 Martin Bly RAL Tier 1/A 5

1/2 July 2004 Martin Bly RAL Tier 1/A 6

1/2 July 2004 Martin Bly RAL Tier 1/A 6

Next Procurement • Need in production by January 2005 – Original schedule of December

Next Procurement • Need in production by January 2005 – Original schedule of December delivery seems late – Will have to start very soon – Less chance for testing / new technology • Exact proportions not agreed, but … – 400 KSI 2 K (300 -400 CPUs) – 160 TB disk – 120 TB tape? ? – Network infrastructure? – Core servers (H/A? ? ) – Red. Hat? • Long range plan needs reviewing – also need long range experiment requirements so as to plan environment updates. 1/2 July 2004 Martin Bly RAL Tier 1/A 7

CPU Capacity 1/2 July 2004 Martin Bly RAL Tier 1/A 8

CPU Capacity 1/2 July 2004 Martin Bly RAL Tier 1/A 8

Tier 1 Disk Capacity (TB) 1/2 July 2004 Martin Bly RAL Tier 1/A 9

Tier 1 Disk Capacity (TB) 1/2 July 2004 Martin Bly RAL Tier 1/A 9

High Impact Systems • Looking at replacement hardware for high impact systems: – /home/csf,

High Impact Systems • Looking at replacement hardware for high impact systems: – /home/csf, /rutherford file systems – Mysql servers – AFS cell – Front end / UI hosts – Data movers – NIS master, Mail server • Replacing mix of Solaris, Tru 64 Unix and AIX servers with Linux – consolidation of expertise • Migrate AFS to Open. AFS and then K 5. 1/2 July 2004 Martin Bly RAL Tier 1/A 10

Network Superjanet Test network (eg MBNG) Production VLAN Firewall Rest of Site Server Site

Network Superjanet Test network (eg MBNG) Production VLAN Firewall Rest of Site Server Site Router Production Subnet Test Subnet Site Routable Network Test VLAN Servers Workers Servers 1/2 July 2004 Workers Martin Bly RAL Tier 1/A 11

Network Super. Janet Test network (eg MBNG) Production VLAN Firewall Rest of Site Router

Network Super. Janet Test network (eg MBNG) Production VLAN Firewall Rest of Site Router Server Test VLAN Tier 1 Network Workers Servers 1/2 July 2004 Servers Workers Martin Bly RAL Tier 1/A 12

UKlight • Connection to RAL in September • Funded to end 2005 after which

UKlight • Connection to RAL in September • Funded to end 2005 after which probably merges with Super. Janet 5 • 2. 5 Gb/s now 10 Gb/s from 2006 • Effectively dedicated light path to CERN • Probably not for Tier 1 production but suitable for LCG Data challenges etc, building experience for Super. Janet upgrade. • UKLight -> Starlight 1/2 July 2004 Martin Bly RAL Tier 1/A 13

Forthcoming Challenges • • • Simplify service – less “duplication” Improve storage management Deploy

Forthcoming Challenges • • • Simplify service – less “duplication” Improve storage management Deploy new Fabric Management Red. Hat Enterprise 3 upgrade Network upgrade/reconfigure? ? Another procurement/install Meet challenge of LCG – professionalism LCG Data Challenges … 1/2 July 2004 Martin Bly RAL Tier 1/A 14

Clean up Spaghetti Diagram • How to phase out “Classic” service. . • Simplify

Clean up Spaghetti Diagram • How to phase out “Classic” service. . • Simplify Interfaces: Less GRIDS “More is not always better” 1/2 July 2004 Martin Bly RAL Tier 1/A 15

Storage: Plus and Minus • • • ATA and SATA drives External RAID arrays

Storage: Plus and Minus • • • ATA and SATA drives External RAID arrays SCSI interconnect Ext 2 file system Linux O/S • NFS/Xrootd/http/gridftp/bbftp/srb/…. • • • NO SAN No management layer NO HSM 1/2 July 2004 • • • 2. 5% failure per annum - OK Good architecture, choose well Surprisingly unreliable: change OK – but need journal: XFS? Move to Enterprise 3 Must have SRM Need SAN (Fibre or i. SCSI …) Need virtualisation/DCACHE. . ? ? Martin Bly RAL Tier 1/A 16

Benchmarking • Work by George Prassas on various systems including a 3 ware/SATA RAID

Benchmarking • Work by George Prassas on various systems including a 3 ware/SATA RAID 5 system. • Tuning gains extra performance on RH variants • Performance of RHEL 3 NFS servers and disk I/O not special despite tuning, c/w RH 73 • Considering buying SPEC suite to benchmark everything. 1/2 July 2004 Martin Bly RAL Tier 1/A 17

Fabric Management • Currently run: – Kickstart – cascading config files, implementing PXE –

Fabric Management • Currently run: – Kickstart – cascading config files, implementing PXE – SURE exception monitoring – Automate – automatic interventions • Running out of steam with old systems … – “Only” 800 systems – but many, many flavours – Evaluating Quator – no obvious alternatives – probably deploy – Less convinced by Lemon – bit early – running Nagios in parallel 1/2 July 2004 Martin Bly RAL Tier 1/A 18

Yum / Yumit • Kickstart scripts now use Yum to bootstrap systems to latest

Yum / Yumit • Kickstart scripts now use Yum to bootstrap systems to latest updates • Post-install config now uses Yum wherever possible for local additions • Yumit: – Nodes use Yum to check their status very night and report to central database – Web interface to show farm status – Easy to see which nodes need updating. • Machine ownership tagging, port monitoring project 1/2 July 2004 Martin Bly RAL Tier 1/A 19

Futures • Storage Architectures – i. SCSI, Fibre, d. Cache – Need to be

Futures • Storage Architectures – i. SCSI, Fibre, d. Cache – Need to be more sophisticated to allow reallocation of available space • CPUs – Xeon, Opteron, Itanium, Intel 64 bit x 86 architecture • Network – Higher speed interconnect, i. SCSI 1/2 July 2004 Martin Bly RAL Tier 1/A 20

Conclusions • After several years of relative stability must start reengineering many Tier 1

Conclusions • After several years of relative stability must start reengineering many Tier 1 components. • Must start to rationalise – support limited set of interfaces, operating systems, testbeds … simplify so we can do less better • LCG becoming a big driver – Service commitments – Increase resilience and availability – Data challenges and move to steady state • Major reality check in 2007! 1/2 July 2004 Martin Bly RAL Tier 1/A 21