Host Side Dynamic Reconfiguration with Infini Band TM
Host Side Dynamic Reconfiguration with Infini. Band. TM By Wei Lin Guay*, Sven-Arne Reinemo*, Olav Lysne*, Tor Skeie*, Bjørn Dag Johnsen^ and Line Holen^ *Simula Research Laboratory ^Sun Microsystems
Introduction • The quest for ever increasing computing power drives the state-of-art large scale clusters. • In Top 500 list, more than 20 sites have > 10 k processors supercomputers. • The increased cluster size is challenging the reliability of interconnects – Infini. Band.
Introduction • What are the available fault tolerance mechanisms? § Check-point/restart: Halted and restarted from the last checkpoint. Disadvantages: non-application transparent. § Deadlock-free re-routing Application transparent. Disadvantages: Inflexible. § Network Dynamic reconfiguration is the trend!
Network Dynamic Reconfiguration • Network dynamic reconfiguration. § Move from one routing function to another while system is up and running. § Application transparent. § More flexible. • Challenges of network dynamic reconfiguration § Deadlock freedom in the transition phase. § Assume that the network interface attributes have not been changed.
Host Side Dynamic Reconfiguration • Host Side Dynamic Reconfiguration. § Migrate the attributes of the connection (Queue Pair) from the old routing structure to the new one. § Fault tolerance mechanism. § Live Migration § Policy Changes – Cluster Maintenance. • Challenges of Host Side Dynamic Reconfiguration. § Which component to trigger the changes of routing path during the fault happened? § Setup prior alternative paths? § Network manager responsible to find new path?
Challenges of Dynamic Reconfiguration • RC connection established between A and B.
Challenges of Dynamic Reconfiguration • RC connection established between A and B. • During the transmission, a link fails!
Challenges of Dynamic Reconfiguration • RC connection established between A and B. • During the transmission, a link fails! • SM regenerated a deadlock free routing table.
Challenges of Dynamic Reconfiguration • RC connection established between A and B. • During the transmission, a link fails! • SM regenerated a deadlock free routing table. Predefined deadlock free and shortest path for every <src, dst> paths are very difficult!
Host Side Dynamic Reconfiguration
Host Side Dynamic Reconfiguration
Host Side Dynamic Reconfiguration
Host Side Dynamic Reconfiguration
Host Side Dynamic Reconfiguration 1
Host Side Dynamic Reconfiguration 2
Host Side Dynamic Reconfiguration 3
Host Reconfiguration • Keep track active QPs created in each host stack
Host Reconfiguration • Keep track active QPs created in each host stack • Modify QP’s context in RTS state § Reset Queue Pair
Host Reconfiguration • Keep track active QPs created in each host stack • Modify QP’s context in RTS state § Reset Queue Pair § Send Queue Drain(SQD)
Host Reconfiguration • Keep track active QPs created in each host stack • Modify QP’s context in RTS state § Reset Queue Pair § Send Queue Drain(SQD) § Auto. Path Mig. (APM)
Performance Evaluation • Synthetic Traffic Patterns. § 6 -3: 5 -2: 4 -1: 3 -6: 2 -5: 1 -4 • Application traffic patterns § HPCC b_eff
Performance Evaluation • Micro benchmark − Setup Phase: No additional overhead!
Performance Evaluation • Synthetic traffic patterns
Performance Evaluation • HPCC b_eff • Without dynamic reconfiguration § Benchmark will not complete once the first fault happened. § Deadlock happened!
Conclusion • Novel fault tolerance mechanism § Feedback from SM. § Application Transparent. Future Work • Evaluation of Scalability. § Event notification. • Live Migration of Virtualization.
Thanks!
- Slides: 26