Computing Networking User Group Meeting Roy Whitney Andy





















- Slides: 21
Computing & Networking User Group Meeting Roy Whitney Andy Kowalski Sandy Philpott Chip Watson 17 June 2008 1
Users and JLab IT • Ed Brash is User Group Board of Directors’ representative on the IT Steering Committee. • Physics Computing Committee (Sandy Philpott) • Helpdesk and CCPR requests and activities • Challenges – Constrained budget • Staffing • Aging infrastructure – Cyber Security 2
Computing and Networking Infrastructure Andy Kowalski 3
CNI Outline • Helpdesk • Computing • Wide Area Network • Cyber Security • Networking and Asset Management 4
Helpdesk • Hour 8 am-12 pm M-F – Submit a CCPR via http: //cc. jlab. org/ – Dial x 7155 – Send email to helpdesk@jlab. org • Windows XP, Vista and RHEL 5 Supported Desktops – Migrating older desktops • Mac Support? 5
Computing • Email Servers Upgraded – Dovecot IMAP Server (Indexing) – New File Server and IMAP Servers (Farm Nodes) • Servers Migrating to Virtual Machines • Printing – Centralized Access via jlabprt. jlab. org – Accounting Coming Soon • Video Conferencing (working on EVO) 6
Wide Area Network • Bandwidth – 10 Gbps WAN and LAN backbone – Offsite Data Transfer Servers • scigw. jlab. org(bbftp) • qcdgw. jlab. org(bbcp) 7
Cyber Security Challenge • The threat: sophistication and volume of attacks continue to increase. – Phishing Attacks • Spear Phishing/Whaling are now being observed at JLab. • Federal, including DOE, requirements to meet the cyber security challenges require additional measures. • JLab uses a risk based approach that incorporates achieving the mission while at the same time dealing with the threat. 8
Cyber Security • Managed Desktops – Skype Allowed From Managed Desktops On Certain Enclaves • Network Scanning • Intrusion Detection • PII/SUI (CUI) Management 9
Networking and IT Asset Management • Network Segmentation/Enclaves – Firewalls • Computer Registration – https: //reggie. jlab. org/user/index. php • Managing IP Addresses – DHCP • Assigns all IP addresses (most static) • Integrated with registration • Automatic Port Configuration – Rolling out now – Uses registration database 10
Scientific Computing Chip Watson & Sandy Philpott 11
Farm Evolution Motivation • Capacity upgrades – Re-use of HPC clusters • Movement to Open Source – O/S upgrade – Change from LSF to PBS 13
Farm Evolution Timetable Nov 07: Auger/PBS available – RHEL 3 - 35 nodes Jan 08: Fedora 8 (F 8) available – 50 nodes May 08: Friendly-user mode; IFARML 4, 5 Jun 08: Production – F 8 only; IFARML 3 + 60 nodes from LSF IFARML alias Jul 08: IFARML 2 + 60 nodes from LSF Aug 08: IFARML 1 + 60 nodes from LSF Sep 08: RHEL 3/LSF->F 8/PBS Migration complete – No renewal of LSF or RHEL for cluster nodes 14
Farm F 8/PBS Differences • Code must be recompiled – 2. 6 kernel – gcc 4 • Software installed locally via yum – cernlib – Mysql • Time limits: 1 day default, 3 days max • stdout/stderr to ~/farm_out • Email notification 15
Farm Future Plans • Additional nodes – from HPC clusters • CY 08: ~120 4 g nodes • CY 09 -10: ~60 6 n nodes – Purchase as budgets allow • Support for 64 bit systems when feasible & needed 16
Storage Evolution • Deployment of Sun x 4500 “thumpers” • Decommissioning of Panasas (old /work server) • Planned replacement of old cache nodes 17
Tape Library • Current STK “Powderhorn” silo is nearing end-of-life – Reaching capacity & running out of blank tapes – Doesn’t support upgrade to higher density cartridges – Is officially end-of-life December 2010 • Market trends – LTO (Linear Tape Open) Standard has proliferated since 2000 – LTO-4 is 4 x density, capacity/$, and bandwidth of 9940 b: 800 GB/tape, $100/TB, 120 MB/s – LTO-5, out next year, will double capacity, 1. 5 x bandwidth: 1600 GB/tape, 180 MB/s – LTO-6 will be out prior to the 12 Ge. V era 3200 GB/tape, 270 MB/s 18
Tape Library Replacement • Competitive procurement now in progress – Replace old system, support 10 x growth over 5 years • Phase 1 in August – System integration, software evolution – Begin data transfers, re-use 9940 b tapes • Tape swap through January • 2 PB capacity by November • DAQ to LTO-4 in January 2009 • Old silo gone in March 2009 End result: breakeven on cost by the end of 2009! 19
Long Term Planning • Continue to increase compute & storage capacity in most cost effective manner • Improve processes & planning – PAC submission process – 12 Ge. V Planning… 20
E. g. : Hall B Requirements Event Simulation SPECint_rate 2006 sec/event Number of events Event size (KB) % Stored Long Term Total CPU (SPECint_rate 2006) Petabytes / year (PB) Data Acquisition Average event size (KB) Max sustained event rate (k. Hz) Average 24 -hour duty factor (%) Weeks of operation / year Network (n*10 gig. E) Petabytes / year 1 st Pass Analysis 2012 2013 2014 2015 2016 1. 8 1. 00 E+12 20 10% 5. 7 E+04 2 25% 5. 7 E+04 5 20 0 0 0% 0 1 0. 0 20 10 10 50% 0 1 0. 0 20 10 10 60% 30 1 2. 2 20 20 10 65% 30 1 2. 4 2012 2013 2014 2015 2016 SPECint_rate 2006 sec/event Number of analysis passes Event size out / event size in Total CPU (SPECint_rate 2006) Silo Bandwidth (MB/s) Petabytes / year 1. 5 0 2 0. 0 E+00 0 0. 0 1. 5 2 0. 0 E+00 900 0. 0 1. 5 2 7. 8 E-03 900 4. 4 1. 5 2 8. 4 E-03 1800 4. 7 Total SPECint_rate 2006 / node # nodes needed (current year) Petabytes / year 5. 7 E+04 600 95 2 5. 7 E+04 900 63 5 5. 7 E+04 1350 42 5 5. 7 E+04 2025 28 12 5. 7 E+04 3038 19 12
LQCD Computing • JLab operates 3 clusters with nearly 1100 nodes, primarily for LQCD plus some accelerator modeling • National LQCD Computing Project (2006 -2009: BNL, FNAL, JLab; USQCD Collaboration) • LQCD II proposal 2010 -2014 would double the hardware budget to enable key calculations • JLab Experimental Physics & LQCD computing share staff (operations & software development) & tape silo, providing efficiencies for both 22