Experience using Infiniband for computational chemistry Mark Moraes
![Experience using Infiniband for computational chemistry Mark Moraes D. E. Shaw Research, LLC Experience using Infiniband for computational chemistry Mark Moraes D. E. Shaw Research, LLC](https://slidetodoc.com/presentation_image/be8f8694a976280874cd1ef95457a700/image-1.jpg)
Experience using Infiniband for computational chemistry Mark Moraes D. E. Shaw Research, LLC
![Molecular Dynamics (MD) Simulation The goal of D. E. Shaw Research: Single, millisecond-scale MD Molecular Dynamics (MD) Simulation The goal of D. E. Shaw Research: Single, millisecond-scale MD](http://slidetodoc.com/presentation_image/be8f8694a976280874cd1ef95457a700/image-2.jpg)
Molecular Dynamics (MD) Simulation The goal of D. E. Shaw Research: Single, millisecond-scale MD simulations Why? That’s the time scale at which biologically interesting things start to happen
![A major challenge in Molecular Biochemistry § Decoded the genome § Don’t know most A major challenge in Molecular Biochemistry § Decoded the genome § Don’t know most](http://slidetodoc.com/presentation_image/be8f8694a976280874cd1ef95457a700/image-3.jpg)
A major challenge in Molecular Biochemistry § Decoded the genome § Don’t know most protein structures § Don’t know what most proteins do – Which ones interact with which other ones – What is the “wiring diagram” of • Gene expression networks • Signal transduction networks • Metabolic networks § Don’t know how everything fits together into a working system
![Molecular Dynamics Simulation Compute the trajectories of all atoms in a chemical system Iterate Molecular Dynamics Simulation Compute the trajectories of all atoms in a chemical system Iterate](http://slidetodoc.com/presentation_image/be8f8694a976280874cd1ef95457a700/image-4.jpg)
Molecular Dynamics Simulation Compute the trajectories of all atoms in a chemical system Iterate § 25, 000+ atoms § For 1 ms (10 -3 seconds) § Requires 1 fs (10 -15 seconds) timesteps 1012 timesteps Iterate with 108 operations per timestep => 1020 operations per simulation Years on the fastest current supercomputers & clusters
![Our Strategy Parallel Architectures – Designing a specialized supercomputer (Anton, ISCA 2007, Shaw et Our Strategy Parallel Architectures – Designing a specialized supercomputer (Anton, ISCA 2007, Shaw et](http://slidetodoc.com/presentation_image/be8f8694a976280874cd1ef95457a700/image-5.jpg)
Our Strategy Parallel Architectures – Designing a specialized supercomputer (Anton, ISCA 2007, Shaw et al) – Enormously parallel architecture – Based on special-purpose ASICs – Dramatically faster for MD, but less flexible – Projected completion: 2008 Parallel Algorithms – Applicable to • Conventional clusters (Desmond, SC 2006, Bowers et al) • Our own machine – Scale to very large # of processing elements
![Why RDMA? § Today: Low-latency, high-bandwidth interconnect for parallel simulation • Infiniband § Tomorrow: Why RDMA? § Today: Low-latency, high-bandwidth interconnect for parallel simulation • Infiniband § Tomorrow:](http://slidetodoc.com/presentation_image/be8f8694a976280874cd1ef95457a700/image-6.jpg)
Why RDMA? § Today: Low-latency, high-bandwidth interconnect for parallel simulation • Infiniband § Tomorrow: – High bandwidth interconnect for storage – Low-latency, high-bandwidth interconnect for parallel analysis • Infiniband • Ethernet
![Basic MD simulation step Each process in parallel: § Compute forces (if own the Basic MD simulation step Each process in parallel: § Compute forces (if own the](http://slidetodoc.com/presentation_image/be8f8694a976280874cd1ef95457a700/image-7.jpg)
Basic MD simulation step Each process in parallel: § Compute forces (if own the midpoint of the pair) § Export forces § Sum forces for particles in homebox § Compute new positions for homebox particles § Import updated positions for particles in import region
![Parallel MD simulation using RDMA Implemented with the Verbs interface. RDMA writes are used Parallel MD simulation using RDMA Implemented with the Verbs interface. RDMA writes are used](http://slidetodoc.com/presentation_image/be8f8694a976280874cd1ef95457a700/image-8.jpg)
Parallel MD simulation using RDMA Implemented with the Verbs interface. RDMA writes are used (faster than RDMA reads). § For iterative exchange patterns, two sets of send and receive buffers are enough to guarantee that send and receive buffers are always available § Send and receive buffers are fixed and registered beforehand § The sender knows all the receiver’s receive buffers and use them alternately
![Refs: Kumar, Almasi, Huang et al. 2006; Fitch, Rayshubskiy, Eleftheriou et al. 2006 Refs: Kumar, Almasi, Huang et al. 2006; Fitch, Rayshubskiy, Eleftheriou et al. 2006](http://slidetodoc.com/presentation_image/be8f8694a976280874cd1ef95457a700/image-9.jpg)
Refs: Kumar, Almasi, Huang et al. 2006; Fitch, Rayshubskiy, Eleftheriou et al. 2006
![Apo. A 1 Production Parameters Apo. A 1 Production Parameters](http://slidetodoc.com/presentation_image/be8f8694a976280874cd1ef95457a700/image-10.jpg)
Apo. A 1 Production Parameters
![Apo. A 1 Production Parameters Apo. A 1 Production Parameters](http://slidetodoc.com/presentation_image/be8f8694a976280874cd1ef95457a700/image-11.jpg)
Apo. A 1 Production Parameters
![](http://slidetodoc.com/presentation_image/be8f8694a976280874cd1ef95457a700/image-12.jpg)
![](http://slidetodoc.com/presentation_image/be8f8694a976280874cd1ef95457a700/image-13.jpg)
![Management & Diagnostics: A view from the field § Design & Planning § Deployment Management & Diagnostics: A view from the field § Design & Planning § Deployment](http://slidetodoc.com/presentation_image/be8f8694a976280874cd1ef95457a700/image-14.jpg)
Management & Diagnostics: A view from the field § Design & Planning § Deployment § Daily Operations § Detection of Problems § Diagnosis of Problems
![Design § HCAs, Cables and Switches: OK, that’s kinda like Ethernet. § Subnet Manager. Design § HCAs, Cables and Switches: OK, that’s kinda like Ethernet. § Subnet Manager.](http://slidetodoc.com/presentation_image/be8f8694a976280874cd1ef95457a700/image-15.jpg)
Design § HCAs, Cables and Switches: OK, that’s kinda like Ethernet. § Subnet Manager. Hmm, it’s not Ethernet. § IPo. IB: Ah, it’s IP. § SDP: Nope, not really IP. § SRP: Wait, that’s not FC. § i. SER: Oh, that’s going to talk i. SCSI. § u. DAPL: Uh-oh. . . § Awww, doesn’t connect to anything else I have! § But it’s got great bandwidth and latency. . . § I’ll need new kernels everywhere. § And I should *converge* my entire network with this? !
![Design § Reality-based performance modeling. – The myth of N usec latency. – The Design § Reality-based performance modeling. – The myth of N usec latency. – The](http://slidetodoc.com/presentation_image/be8f8694a976280874cd1ef95457a700/image-16.jpg)
Design § Reality-based performance modeling. – The myth of N usec latency. – The myth of NNN Mi. B/sec bandwidth. § Inter-switch connections. § Reliability and availability. § A cost equation: – Capital – Install (cables and bringup) – Failures (app crashes, HCAs, cables, switch ports, subnet manager) – Changes and growth – What’s the lifetime?
![Deployment § De-coupling for install and upgrade: – Drivers [Understandable cost, grumble, mumble] – Deployment § De-coupling for install and upgrade: – Drivers [Understandable cost, grumble, mumble] –](http://slidetodoc.com/presentation_image/be8f8694a976280874cd1ef95457a700/image-17.jpg)
Deployment § De-coupling for install and upgrade: – Drivers [Understandable cost, grumble, mumble] – Kernel protocol modules [Ouch!] – User-level protocols [Ok] § How do I measure progress and quality of an installation? – Software – Cables – Switches
![Gratuitous sidebar: Consider a world in which. . . § The only network that Gratuitous sidebar: Consider a world in which. . . § The only network that](http://slidetodoc.com/presentation_image/be8f8694a976280874cd1ef95457a700/image-18.jpg)
Gratuitous sidebar: Consider a world in which. . . § The only network that matters to Enterprise IT is TCP and UDP over IP. § Each new non-IP acronym causes a louder, albeit silent scream of pain from Enterprise IT. § Enterprise IT hates having anyone open a machine and add a card to it. § Enterprise IT really, absolutely, positively hates updating kernels. § Enterprise IT never wants to recompile any application (if they even have, or could find the source code)
![Daily Operation § Silence is golden. . – If it’s working well. § Capacity Daily Operation § Silence is golden. . – If it’s working well. § Capacity](http://slidetodoc.com/presentation_image/be8f8694a976280874cd1ef95457a700/image-19.jpg)
Daily Operation § Silence is golden. . – If it’s working well. § Capacity utilization. – Collection of stats • SNMP • RRDtool, MRTG and friends (ganglia, drraw, cricket, . . . ) – Spot checks for errors. Is it really working well?
![Detection of problems Reach out and touch someone. . . § SNMP Traps. § Detection of problems Reach out and touch someone. . . § SNMP Traps. §](http://slidetodoc.com/presentation_image/be8f8694a976280874cd1ef95457a700/image-20.jpg)
Detection of problems Reach out and touch someone. . . § SNMP Traps. § Email. § Syslog. A GUID is not a meaningful error message. Hostnames, card numbers, port numbers, please! Error counters: Non-zero BAD, Zero GOOD. Corollary: It must be impossible to have a network error where all error counters are zero.
![Diagnosis of problems § What path caused the error? § Where are netstat, ping, Diagnosis of problems § What path caused the error? § Where are netstat, ping,](http://slidetodoc.com/presentation_image/be8f8694a976280874cd1ef95457a700/image-21.jpg)
Diagnosis of problems § What path caused the error? § Where are netstat, ping, ttcp/iperf and traceroute equivalents? § What should I replace? – Remember the sidebar – Software? HCA? Cable? Switch port? Switch card? § Will I get better (any? ) support if I’m running – the vendor stack(s)? – or the latest OFED?
![A closing comment. Our research group uses our Infiniband cluster heavily. Continuously. For two A closing comment. Our research group uses our Infiniband cluster heavily. Continuously. For two](http://slidetodoc.com/presentation_image/be8f8694a976280874cd1ef95457a700/image-22.jpg)
A closing comment. Our research group uses our Infiniband cluster heavily. Continuously. For two years and counting. Our Infiniband interconnect has had fewer total failures than our expensive, enterprise-grade Gig. E switches.
![Questions? moraes@deshaw. com Questions? moraes@deshaw. com](http://slidetodoc.com/presentation_image/be8f8694a976280874cd1ef95457a700/image-23.jpg)
Questions? moraes@deshaw. com
![Leaf switches and core switches and cables, oh my! Core switch Leaf switch Core Leaf switches and core switches and cables, oh my! Core switch Leaf switch Core](http://slidetodoc.com/presentation_image/be8f8694a976280874cd1ef95457a700/image-24.jpg)
Leaf switches and core switches and cables, oh my! Core switch Leaf switch Core switch . . . 10 more Leaf switch . . . 86 more 12 servers
- Slides: 24