Case study IBM BluegeneL system Infini Band Interconnect
Case study • IBM Bluegene/L system • Infini. Band
Interconnect Family share in top 500 supercomputers on 06/2011 and 11/2012 Infini. Band 224 Gigabit Ethernet 232 Gigabit Ethernet 189 Infini. Band 206 Custom interconnect 53 Propritary Network 29 Cray interconnect 15 Custom interconnect 23 Proprietary Network 15 Myrinet 4 Myrinet 3 NUMAlink 2 11/2012 06/2011
Overview of the IBM Blue Gene/L System Architecture • Design objectives • Hardware overview – System architecture – Node architecture – Interconnect architecture
Highlights • A 64 K-node highly integrated supercomputer based on system-on-a-chip technology – Two ASICs • Blue Gene/L compute (BLC), Blue Gene/L Link (BLL) • Distributed memory, massively parallel processing (MPP) architecture. • Use the message passing programming model (MPI). • 360 Tflops peak performance • Optimized for cost/performance
Design objectives • Objective 1: 360 -Tflops supercomputer – Earth Simulator (Japan, fastest supercomputer from 2002 to 2004): 35. 86 Tflops • Objective 2: power efficiency – Performance/rack = performance/watt * watt/rack • Watt/rack is a constant of around 20 k. W • Performance/watt determines performance/rack
• Power efficiency: – 360 Tflops => 20 megawatts with conventional processors – Need low-power processor design (2 -10 times better power efficiency)
Design objectives (continue) • Objective 3: extreme scalability – Optimized for cost/performance use low power, less powerful processors need a lot of processors • Up to 65536 processors. – Interconnect scalability
Blue Gene/L system components
Blue Gene/L Compute ASIC • 2 Power PC 440 cores with floating-point enhancements – 700 MHz – Everything of a typical superscalar processor • Pipelined microarchitecture with dual instruction fetch, decode, and out of order issue, out of order dispatch, out of order execution and out of order completion, etc – 1 W each through extensive power management
Blue Gene/L Compute ASIC
Memory system on a BGL node • BG/L only supports distributed memory paradigm. • No need for efficient support for cache coherence on each node. – Coherence enforced by software if needed. • Two cores operate in two modes: – Communication coprocessor mode • Need coherence, managed in system level libraries – Virtual node mode • Memory is physical partitioned (not shared).
Blue Gene/L networks • Five networks. – 100 Mbps Ethernet control network for diagnostics, debugging, and some other things. – 1000 Mbps Ethernet for I/O – Three high-band width, low-latency networks for data transmission and synchronization. • 3 -D torus network for point-to-point communication • Collective network for global operations • Barrier network • All network logic is integrated in the BG/L node ASIC – Memory mapped interfaces from user space
3 -D torus network • Support p 2 p communication • Link bandwidth 1. 4 Gb/s, 6 bidirectional link per node (1. 2 GB/s). • 64 x 32 torus: diameter 32+16+16=64 hops, worst case hardware latency 6. 4 us. • Cut-through routing • Adaptive routing
Collective network • Binary tree topology, static routing • Link bandwidth: 2. 8 Gb/s • Maximum hardware latency: 5 us • With arithmetic and logical hardware: can perform integer operation on the data – Efficient support for reduce, scan, global sum, and broadcast operations – Floating point operation can be done with 2 passes.
Barrier network • Hardware support for global synchronization. • 1. 5 us for barrier on 64 K nodes.
IBM Blue. Gene/L summary • Optimize cost/performance – limiting applications. – Use low power design • Lower frequency, system-on-a-chip • Great performance per watt metric • Scalability support – Hardware support for global communication and barrier – Low latency, high bandwidth support
• Case 2: Infiniband architecture – Specification (Infiniband architecture specification release 1. 2. 1, January 2008/Oct. 2006) available at Infiniband Trade Association (http: //www. infinibandta. org)
• Infiniband architecture overview
• Infiniband architecture overview – Components: • • Links Channel adaptors Switches Routers – The specification allows Infiniband wide area network, but mostly adopted as a system/storage area network. – Topology: • Irregular • Regular: Fat tree , hypercube, torus, etc. – Link speed: • Single data rate (SDR): 2. 5 Gbps (X), 10 Gbps (4 X), and 30 Gbps (12 X). • Double data rate (DDR): 5 Gbps (X), 20 Gbps (4 X) • Quad data rate (QDR): 40 Gbps (4 X) • Fourteen data rate (FDR): 56 Gbps (4 X)
• Layers: somewhat similar to TCP/IP – Physical layer – Link layer • • Error detection (CRC checksum) flow control (credit based) switching, virtual lanes (VL), forwarding table computed by subnet manager – Single path deterministic routing (not adaptive) – Network layer: across subnets. • No use for the cluster environment – Transport layer • Reliable/unreliable, connection/datagram – Verbs: interface between adaptors and OS/Users
• Infiniband Link layer Packet format: • Local Route Header (LRH): 8 bytes. Used for local routing by switches within a IBA subnet • Global Route Header (GRH): 40 Bytes. Used for routing between subnets • Base Transport header (BTH): 12 Bytes, for IBA transport • Extended transport header – Reliable datagram extended transport header (RDETH): 4 bytes, just for reliable datagram – Datagram extended transport header (DETH): 8 bytes – RDMA extended transport header (RETH): 16 bytes – Atomic, ACK, Atomic ACK, • Immediate DATA extended transport header: 4 bytes, optimized for small packets. • Invariant CRC and variant CRC: – CRC for fields not changed and changed.
• Local Route Header: – Switching based on the destination port address (LID) – Multipath switching by allocating multiple LIDs to one port
Subnet management • Initialize the network – Discover subnet topology and topology changes, compute the paths, assign LIDs, distribute the routes, configure devices. – Related devices and entities • Devices: Channel Adapters (CA), Host Channel Adapters, switches, routers • Subnet manager (SM): discovering, configuring, activating and managing the subnet • A subnet management agent (SMA) in every device generates, responses to control packets (subnet management packets (SMPs)), and configures local components for subnet management • SM exchange control packets with SMA with subnet management interface (SMI).
• Subnet Management phases: – Topology discovery: sending direct routed SMP to every port and processing the responses. – Path computation: computing valid paths between each pair of end node – Path distribution phase: configuring the forwarding table
• Base transport header:
• Verbs – OS/Users access the adaptor through verbs – Communication mechanism: Queue Pair (QP) • Users can queue up a set of instructions that the hardware executes. • A pair of queues in each QP: one for send, one for receive. • Users can post send requests to the send queue and receive requests to the receive queue. • Three types of send operations: SEND, RDMA(WRITE, READ, ATOMIC), MEMORY-BINDING • One receive operation (matching SEND)
• To communicate: – Make system calls to setup everything (open QP, bind QP to port, bind complete queues, connect local QP to remote QP, register memory, etc). – Post send/receive requests as user level instructions. – Check completion.
• Infini. Band has an almost perfect software/network interface: – The network subsystem realizes most user level functionality. • Network supports in-order delivery and fault tolerance. • Buffer management is pushed out to the user. – OS bypass: User level accesses to the network interface. A few machine instructions will accomplish the transmission task without involving the OS.
- Slides: 29