Dolphin Wulfkit and Scali software The Supercomputer interconnect

  • Slides: 45
Download presentation
Dolphin Wulfkit and Scali software The Supercomputer interconnect Amal D’Silva amal@summnet. com Summation Enterprises

Dolphin Wulfkit and Scali software The Supercomputer interconnect Amal D’Silva amal@summnet. com Summation Enterprises Pvt. Ltd. Preferred System Integrators since 1991.

Agenda • Dolphin Wulfkit hardware • Scali Software / some commercial benchmarks • Summation

Agenda • Dolphin Wulfkit hardware • Scali Software / some commercial benchmarks • Summation profile

Interconnect Technologies Application areas: WAN LAN I/O Memory Cache Processor Fibre. Channel Design space

Interconnect Technologies Application areas: WAN LAN I/O Memory Cache Processor Fibre. Channel Design space for different technologie s Etherne t AT M Network Application requirements: Distance Bandwidt h Latency SCSI Myrinet, c. Lan Proprietary Busses Dolphin SCI Technology Bus Cluster Interconnect Requirements 100 000 10 1 1 100 10 000 50 000 100 000 ∞ 100 000 1 000 20 1 1

Interconnect impact on cluster performance Some Real-world examples from Top 500 May 2004 List

Interconnect impact on cluster performance Some Real-world examples from Top 500 May 2004 List • Intel, Bangalore cluster: Ø Ø Ø 574 Xeon 2. 4 GHz CPUs/ Gig. E interconnect Rpeak: 2755 GFLOPs Rmax: 1196 GFLOPs Efficiency: 43% • Kabru, IMSc, Chennai: 288 Xeon 2. 4 GHz CPUs/ Wulfkit 3 D interconnect Ø Rpeak: 1382 GFLOPs Rmax: 1002 GFLOPs Ø Efficiency: 72% Simply put, Kabru gives 84 % of the performance with HALF the number of CPUs ! Ø •

Commodity interconnect limitations • Cluster performance depends primarily on two factors: Bandwidth and Latency

Commodity interconnect limitations • Cluster performance depends primarily on two factors: Bandwidth and Latency • Gigabit: Speed limited to 1000 mbps (approx 80 Megabytes/s in real world). This is fixed irrespective of processor power • With Increasing processor speeds, latency “time taken to move data” from one node to another is playing an increasing role in cluster performance • Gigabit typically gives an internode latency of 120 ~ 150 microsecs. As a result, CPUs in a node are often idling waiting to get data from another node • In any switch based architecture, the switch becomes the single point of failure. If the switch goes down, so does the cluster.

Dolphin Wulfkit advantages • Internode bandwidth: 260 Megabytes/s on Xeon/ (over three times faster

Dolphin Wulfkit advantages • Internode bandwidth: 260 Megabytes/s on Xeon/ (over three times faster that Gigabit). • Latency: under 5 microsecs ( over TWENTY FIVE times quicker than Gigabit) • Matrix type internode connections: No switch, hence no single point of failure • Cards can be moved across processor generations. This leads to investment protection

Dolphin Wulfkit advantages (contd. ) • Linear scalability: e. g. adding 8 nodes to

Dolphin Wulfkit advantages (contd. ) • Linear scalability: e. g. adding 8 nodes to a 16 node cluster involves known fixed costs: eight nodes and eight Dolphin SCI cards. With any switch based architecture, there additional issues like “unused ports” on the switch to be considered. E. g. For Gigabit, one has to “throw away” the 16 port switch and buy a 32 port switch • Realworld performance on par /better than proprietary interconnects like Memory Channel (HP) and NUMAlink (SGI), at cost effective price points

Wulfkit : The Supercomputer Interconect • Wulfkit is based on the Scalable Coherent Interface

Wulfkit : The Supercomputer Interconect • Wulfkit is based on the Scalable Coherent Interface (SCI), the ANSI/IEEE 1596 -1992 standard defines a point-to-point interface and a set of packet protocols. • Wulfkit is not a networking technology, but is a purpose-designed cluster interconnect. • The SCI interface has two unidirectional links that operate concurrently. • Bus imitating protocol with packet-based handshake protocols and guaranteed data delivery. • Upto 667 Mega. Bytes/s internode bandwidth.

PCI-SCI Adapter Card 1 slot 2 dimensions • SCI ADAPTERS (64 bit - 66

PCI-SCI Adapter Card 1 slot 2 dimensions • SCI ADAPTERS (64 bit - 66 MHz) Ø PCI / SCI ADAPTER (D 335) Ø D 330 card with LC 3 daughter card Ø Supports 2 SCI ring connections Ø Switching over B-Link Ø Used for Wulf. Kit 2 D clusters Ø PCI 64/66 Ø D 339 2 -slot version SCI PCI LC LC PSB 2 D Adapter Card SCI

System Interconnect High Performance Interconnect: • Torus Topology • IEEE/ANSI std. 1596 SCI •

System Interconnect High Performance Interconnect: • Torus Topology • IEEE/ANSI std. 1596 SCI • 667 MBytes/s/segment/ring • Shared Address Space Maintenance and LAN Interconnect: • 100 Mbit/s Ethernet • (out of band monitoring)

System Architecture Remote Workstation Control Node (Frontend) GUI 3 GUI Server daemon SC ITCP/IP

System Architecture Remote Workstation Control Node (Frontend) GUI 3 GUI Server daemon SC ITCP/IP Socket S C Node daemon 4 x 4 2 D Torus SCI cluster

3 D Torus topology (for greater than 64 ~ 72 nodes) PCI PSB 66

3 D Torus topology (for greater than 64 ~ 72 nodes) PCI PSB 66 LC-3

Linköping University - NSC - SCI Clusters Also in Sweden, Umeå University 120 Athlon

Linköping University - NSC - SCI Clusters Also in Sweden, Umeå University 120 Athlon nodes • Monolith: 200 node, 2 x. Xeon, 2, 2 GHz, 3 D SCI • INGVAR: 32 node, 2 x. AMD 900 MHz, 2 D SCI • Otto: 48 node, 2 x. P 4 2. 26 GHz, 2 D SCI • Commercial under installation: 40, 2 x. Xeon, 2 D SCI • Total 320 SCI nodes

MPI connect middleware and MPIManage Cluster setup/ mgmt tools http: //www. scali. com Slide

MPI connect middleware and MPIManage Cluster setup/ mgmt tools http: //www. scali. com Slide 14 - 28. 02. 2021 The difference is in the software. . .

Scali Software Platform • Scali MPI Manage – Cluster Installation /Management • Scali MPI

Scali Software Platform • Scali MPI Manage – Cluster Installation /Management • Scali MPI Connect – High Performance MPI Libraries Slide 15 - 28. 02. 2021 The difference is in the software. . .

Scali MPI Connect • • • Fault Tolerant High Bandwidth Low Latency Multi-Thread safe

Scali MPI Connect • • • Fault Tolerant High Bandwidth Low Latency Multi-Thread safe Simultaneous Inter/Intra-node operation • UNIX command line replicated Slide 16 - 28. 02. 2021 • Exact message size option • Manual/debugger mode for selected processes • Explicit host specification • Job queuing – PBS, DQS, LSF, CCS, NQS, Maui • Conformance to MPI-1. 2 verified through 1665 MPI tests The difference is in the software. . .

Scali MPI Manage features • • • System Installation and Configuration System Administration System

Scali MPI Manage features • • • System Installation and Configuration System Administration System Monitoring Alarms and Event Automation Work Load Management Hardware Management Heterogeneous Cluster Support Slide 17 - 28. 02. 2021 The difference is in the software. . .

Fault Tolerance 2 D Torus topology more routing options XY routing algorithm Node 33

Fault Tolerance 2 D Torus topology more routing options XY routing algorithm Node 33 fails (3) Nodes on 33’s ringlets becomes unavailable Cluster fractured with current routing setting 14 24 34 44 13 23 33 43 12 22 32 42 11 21 31 41

Fault Tolerance Scali advanced routing algorithm: From the Turn Model family of routing algorithms

Fault Tolerance Scali advanced routing algorithm: From the Turn Model family of routing algorithms All nodes but the failed one can be utilised as one big partition 43 13 23 33 42 12 22 32 41 11 21 31 44 14 24 34

Scali MPI Manage GUI Slide 20 - 28. 02. 2021 The difference is in

Scali MPI Manage GUI Slide 20 - 28. 02. 2021 The difference is in the software. . .

Monitoring ctd. Sam 113 51 Slide 21 - 28. 02. 2021 The difference is

Monitoring ctd. Sam 113 51 Slide 21 - 28. 02. 2021 The difference is in the software. . .

System Monitoring Resource Monitoring • CPU • Memory • Disk Hardware Monitoring • Temperature

System Monitoring Resource Monitoring • CPU • Memory • Disk Hardware Monitoring • Temperature • Fan Speed Operator Alarms on selected Parameters at Specified Tresholds Slide 22 - 28. 02. 2021 The difference is in the software. . .

SCI vs. Myrinet 2000: Ping-Pong comparison Slide 24 - 28. 02. 2021 The difference

SCI vs. Myrinet 2000: Ping-Pong comparison Slide 24 - 28. 02. 2021 The difference is in the software. . .

Itanium vs Cray T 3 E Bandwidth Slide 25 - 28. 02. 2021 The

Itanium vs Cray T 3 E Bandwidth Slide 25 - 28. 02. 2021 The difference is in the software. . .

Itanium vs T 3 E Latency Slide 26 - 28. 02. 2021 The difference

Itanium vs T 3 E Latency Slide 26 - 28. 02. 2021 The difference is in the software. . .

Some Reference Customers • • • • • • Spacetec/Tromsø Satellite Station, Norway Norwegian

Some Reference Customers • • • • • • Spacetec/Tromsø Satellite Station, Norway Norwegian Defense Research Establishment Parallab, Norway Paderborn Parallel Computing Center, Germany Fujitsu Siemens computers, Germany Spacebel, Belgium Aerospatiale, France Fraunhofer Gesellschaft, Germany Lockheed Martin TDS, USA University of Geneva, Switzerland University of Oslo, Norway Uni-C, Denmark Paderborn Parallel Computing Center University of Lund, Sweden University of Aachen, Germany DNV, Norway Daimler. Chrysler, Germany AEA Technology, Germany BMW AG, Germany Audi AG, Germany University of New Mexico, USA Slide 27 - 28. 02. 2021 • • • • • Max Planck Institute für Plasmaphysik, Germany University of Alberta, Canada University of Manitoba, Canada Etnus Software, USA Oracle Inc. , USA University of Florida, USA de. CODE Genetics, Iceland Uni-Heidelberg, Germany GMD, Germany Uni-Giessen, Germany Uni-Hannover, Germany Uni-Düsseldorf, Germany Linux Networ. X, USA Magmasoft AG, Germany University of Umeå, Sweden University of Linkøping, Sweden PGS Inc. , USA US Naval Air, USA The difference is in the software. . .

Some more Reference Customers • • • • • • Rolls Royce Ltd. ,

Some more Reference Customers • • • • • • Rolls Royce Ltd. , UK Norsk Hydro, Norway NGU, Norway University of Santa Cruz, USA Jodrell Bank Observatory, UK NTT, Japan CEA, France Ford/Visteon, , Germany ABB AG, Germany National Technical University of Athens, Greece Medasys Digital Systems, France PDG Linagora S. A. , France Workstations UK, Ltd. , England Bull S. A. , France The Norwegian Meteorological Institute, Norway Nanco Data AB, Sweden Aspen Systems Inc. , USA Atipa Linux Solution Inc. , USA California Institute of Technology, USA Compaq Computer Corporation Inc. , USA Fermilab, USA Ford Motor Company Inc. , USA General Dynamics Inc. , USA Slide 28 - 28. 02. 2021 • • • • • • Intel Corporation Inc. , USA IOWA State University, USA Los Alamos National Laboratory, USA Penguin Computing Inc. , USA Times N Systems Inc. , USA University of Alberta, Canada Manash University, Australia University of Southern Mississippi, Australia Jacusiel Acuna Ltda. , Chile University of Copenhagen, Denmark Caton Sistemas Alternativos, Spain Mapcon Geografical Inform, Sweden Fujitsu Software Corporation, USA City Team OY, Finland Falcon Computers, Finland Link Masters Ltd. , Holland MIT, USA Paralogic Inc. , USA Sandia National Laboratory, USA Sicorp Inc. , USA University of Delaware, USA Western Scientific Inc. , USA Group of Parallel and Distr. Processing, Brazil The difference is in the software. . .

Application Benchmarks With Dolphin SCI and Scali MPI Slide 29 - 28. 02. 2021

Application Benchmarks With Dolphin SCI and Scali MPI Slide 29 - 28. 02. 2021 The difference is in the software. . .

NAS parallel benchmarks (16 cpu/8 nodes) Slide 30 - 28. 02. 2021 The difference

NAS parallel benchmarks (16 cpu/8 nodes) Slide 30 - 28. 02. 2021 The difference is in the software. . .

Magma (16 cpus/8 nodes) Slide 31 - 28. 02. 2021 The difference is in

Magma (16 cpus/8 nodes) Slide 31 - 28. 02. 2021 The difference is in the software. . .

Eclipse (16 cpus/8 nodes) Slide 32 - 28. 02. 2021 The difference is in

Eclipse (16 cpus/8 nodes) Slide 32 - 28. 02. 2021 The difference is in the software. . .

FEKO: Parallel Speedup Slide 33 - 28. 02. 2021 The difference is in the

FEKO: Parallel Speedup Slide 33 - 28. 02. 2021 The difference is in the software. . .

Acusolve (16 cpus/8 nodes) Slide 34 - 28. 02. 2021 The difference is in

Acusolve (16 cpus/8 nodes) Slide 34 - 28. 02. 2021 The difference is in the software. . .

Visage (16 cpus/8 nodes) Slide 35 - 28. 02. 2021 The difference is in

Visage (16 cpus/8 nodes) Slide 35 - 28. 02. 2021 The difference is in the software. . .

CFD scaling mm 5: linear to 400 CPUs Slide 36 - 28. 02. 2021

CFD scaling mm 5: linear to 400 CPUs Slide 36 - 28. 02. 2021 The difference is in the software. . .

Scaling - Fluent – Linköping cluster Slide 37 - 28. 02. 2021 The difference

Scaling - Fluent – Linköping cluster Slide 37 - 28. 02. 2021 The difference is in the software. . .

Dolphin Software • • • All Dolphin SW is free open source (GPL or

Dolphin Software • • • All Dolphin SW is free open source (GPL or LGPL) SISCI SCI-SOCKET Ø Low Latency Socket Library Ø TCP and UDP Replacement Ø User and Kernel level support Ø Release 2. 0 available SCI-MPICH (RWTH Aachen) Ø MPICH 1. 2 and some MPICH 2 features Ø New release is being prepared, beta available SCI Interconnect Manager Ø Automatic failover recovery. Ø No single point of failure in 2 D and 3 D networks. Other Ø SCI Reflective Memory, Scali MPI, Linux Labs SCI Cluster Cray-compatible shmem and Clugres Postgre. SQL, Mandrake. Soft Clustering HPC solution, Xprime X 1 Database Performance Cluster for Microsoft SQL Servers, Cluster. Frame from Qlusters and Sun. Cluster 3. 1 (Oracle 9 i), My. SQL Cluster

Summation Enterprises Pvt. Ltd. Brief Company Profile

Summation Enterprises Pvt. Ltd. Brief Company Profile

 • Our expertise: Clustering for High Performance Technical Computing, Clustering for High Availability,

• Our expertise: Clustering for High Performance Technical Computing, Clustering for High Availability, Terabyte Storage solutions, SANs • O. S. skills : Linux (Alpha 64 bit, x 86: 32 and 64 bit), Solaris (SPARC and x 86), Tru 64 unix, Windows NT/ 2 K/ 2003 and the QNX Realtime O. S.

Summation milestones • Working with Linux since 1996 • First in India to deploy/

Summation milestones • Working with Linux since 1996 • First in India to deploy/ support 64 bit Alpha Linux workstations (1999) • First in India to spec, deploy and support a 26 Processor Alpha Linux cluster (2001) • Only company in India to have worked with Gigabit, SCI and Myrinet interconnects • Involved with the design, setup, support of many of the largest HPTC clusters in India.

Exclusive Distributors / System Integrators in India • Dolphin Interconnect AS, Norway – SCI

Exclusive Distributors / System Integrators in India • Dolphin Interconnect AS, Norway – SCI interconnect for Supercomputer performance • Scali AS, Norway – Cluster management tools • Absoft, Inc. , USA – FORTRAN Development tools • Steeleye Inc. , USA – High Availability Clustering and Disaster Recovery Solutions for Windows & Linux – Summation is the sole Distributor, Consulting services & Technical support partner for Steeleye in India

Partnering with Industry leaders • Sun Microsystems, Inc. – Focus on Education & Research

Partnering with Industry leaders • Sun Microsystems, Inc. – Focus on Education & Research segments – High Performance Technical Computing, Grid Computing Initiative with Sun Grid Engine (SGE/ SGEE) – HPTC Competency Centre

Wulfkit / HPTC users • Institute of Mathematical Sciences, Chennai – – 144 node

Wulfkit / HPTC users • Institute of Mathematical Sciences, Chennai – – 144 node Dual Xeon Wulfkit 3 D cluster, 9 node Dual Xeon Wulfkit 2 D cluster 9 node Dual Xeon Ethernet cluster 1. 4 TB RAID storage • Bhaba Atomic Research Centre, Mumbai – 64 node Dual Xeon Wulfkit 2 D cluster – 40 node P 4 Wulfkit 3 D cluster – Alpha servers / Linux Open. GL workstations / Rackmount servers • Harish Chandra Research Institute, Allahabad – Forty Two node Dual Xeon Wulfkit Cluster, – 1. 1 TB RAID Storage

Wulfkit / HPTC users (contd. ) • Intel Technology India Pvt. Ltd. , Bangalore

Wulfkit / HPTC users (contd. ) • Intel Technology India Pvt. Ltd. , Bangalore – Eight node Dual Xeon Wulfkit Clusters (ten nos. ) • NCRA (TIFR), Pune – 4 node Wulfkit 2 D cluster • Bharat Forge Ltd. , Pune – Nine node Dual Xeon Wulfkit 2 D cluster • Indian Rare Earths Ltd. , Mumbai – 26 Processor Alpha Linux cluster with RAID storage • Tata Institute of Fundamental Research, Mumbai – RISC/Unix servers, Four node Xeon cluster • Centre for Advanced Technology, Indore – Alpha/ Sun Workstations

Questions ? Amal D’Silva email: amal@summnet. com GSM: 98202 83309

Questions ? Amal D’Silva email: amal@summnet. com GSM: 98202 83309