Developing a Scalable Coherent Interface SCI device for

Developing a Scalable Coherent Interface (SCI) device for MPJ Express Guillermo López Taboada 14 th October, 2005 Dept. of Electronics and Systems University of A Coruña (Spain) http: //www. des. udc. es Visitor at Distributed Systems Group http: //dsg. port. ac. uk

Outline • Introduction • Design of scidev • Implementation issues • Benchmarking • Future work • Conclusions 12/4/2020 2

Introduction • The interconnection network and its associated software libraries play a key role in High Performance Clustering Technology • Cluster interconnection technologies: • • Gb & 10 Gb Ethernet Myrinet SCI Infiniband Qsnet Quadrics GSN - HIPPI Giganet – Latencies are small (usually under 10 us) – Bandwidths are high (usually above 1 Gbps) 12/4/2020 3

Introduction • SCI (Scalable Coherent Interface) • Latency 1. 42 us (theoretical) • Bandwidth 5333 Mbps (bi-directional) –Usually without switch (small clusters) –Topologies 1 D (ring) / 2 D (torus 2 D) 12/4/2020 4

Introduction • Example of a 2 D torus SCI cluster with FE (admin) 12/4/2020 5

Introduction • Software available from Dolphinics: • Software available from Scali: 12/4/2020 –Sca. IP: IP emulation –Sca. SISCI: SISCI (Sw Infrastructure for SCI) –Sca. MPI: proprietary MPI implementation 6

Introduction • Java’s portability means in networking that only the widely extended TCP/IP is supported by the JDK • Previously, IP emulations were used (Sca. IP & SCIP) but performance is similar to FE • Now a High Performance Socket Implementation, SCI SOCKETS • Similar to other Interconnection Tech. Myrinet (IPo. GM->GMSockets) 12/4/2020 7

Introduction • Several research projects have been trying to get support in Java for these System Area Networks, mainly in Myrinet: –Ka. RMI/GM (Java. Party, Univ. Karlsruhe) –Manta/LFC/Panda/Ibis (Univ. Vrije – Holland) –Java GM Sockets –RMIX myrinet –mpi. Java/MPICH-GM or MPICH-MX –… • But nothing in SCI 12/4/2020 8

Introduction • My Ph. D Project: “Designing Efficient Mechanisms for Java communications on SCI systems” • The motivation is filling the gap between Java and this high-speed interconnect, which lacks of sw support for Java – SCI Java Fast Sockets – An SCI communication device, base of a messaging system – SCI Channel for Java NIO – Wrappers for some libraries – Optimized RMI for High Speed Networks – Low level Java buffering and communication system 12/4/2020 9

Introduction • MPJ Express, a reference implementation of the MPI bindings for the Java language, has been released. –Already mature bindings for C, C++, and Fortran, but ongoing efforts on the Java binding at DSG • A good opportunity to provide SCI support to a messaging system 12/4/2020 10

Outline • Introduction • Design of scidev • Implementation issues • Benchmarking • Future work • Conclusions 12/4/2020 11

Design of scidev • Use of Java Native Interface JNI (unavoidable) –In order to provide support and good performance we have to rely on specific low level libraries –In the presence of SCI hw it should use it –Lost of portability in exchange of higher performance –Differences between mpi. Java and scidev: • mpi. Java- thin wrapper providing a large number of Java MPI primitives • scidev- thicker layer providing a small API 12/4/2020 12

Design of scidev • Implementing the xdev API: –init() –finish() –id() –iprobe(Process. ID src. ID, int tag, int context) –irecv(Buffer buf, Process. ID src. ID, int tag, int context, Status status) –isend(Buffer buf, Process. ID dest. ID, int tag, int context) –and the blocking counterparts of these functions: probe, recv, send + issend & ssend 12/4/2020 13

Design of scidev 12/4/2020 14

Design of scidev mpjdev JVM xdev JNI O. S 12/4/2020 mxdev scidev Native Libraries 15

Design of scidev • Native libraries: SCILib and SISCI SCILIB 12/4/2020 16

Outline • Introduction • Design of scidev • Implementation issues • Benchmarking • Future work • Conclusions 12/4/2020 17

Implementation Issues • Optimizations / initialization process: –JNI: Caching field identifiers and references to objects –Sending 2 messages in Long protocol • 1 st from a 4 -byte multiple address and second from a 128 -byte multiple address up to a 128 -byte multiple address (go further the end of the message – raw Buffer has a 2^n length) –Algorithm to init the message queues of SCILib • • 12/4/2020 Connect (to nodes with lower rank) Create (for all nodes, beginning with the following rank) Connect (the remaining nodes) The complexity is O(n) 18

Implementation Issues • Tranport protocols: – 3 native protocols: • Inline 1 -113 b • Short 114 b-64 Kb • Long 64 Kb-1 Mb –scidev fragments messages > 1 MB and is using: • Inline for control messages and small messages<113 b • Short with PIO (Programmed Input-Output) for messages < 8 Kb • Short with DMA (Direct Memory Access) for messages 8 -64 Kb • Long in user level libraries does not use DMA transfers, so it is replaced by own Long protocol with DMA tx 12/4/2020 19

Implementation Issues • Communications: –scidev is based on non-blocking communications –It’s coded having niodev as template –Asynchronous sends for messages sizes > 1 MB –Notification strategy: • Following the approach of SCI SOCKET, using the mbox interruption library • Created without transfering the references (SCI interrupt handlers) • Each interruption (both user_interruptions and dma_interruptions) register a callback method 12/4/2020 20

Implementation Issues • Sending/Receiving: – 2 threads: user and selector thread, synchronized for reducing latency – 1 message queue in which the control messages of pending communications are kept –Sending directly from the “Buffer” Direct Byte. Buffer –If selector thread receives a message not posted -> creates an intermediate buffer for temporal storage –If the message has been posted, it copies the message directly to the “Buffer” Direct Byte. Buffer 12/4/2020 21

Implementation Issues This schema for each pair of nodes selector thread user thread SBUFFER RBUFFER ULL LONG SHORT Inline 12/4/2020 user thread SCI Intermediate Queue Inline 22

Outline • Introduction • Design of scidev • Implementation issues • Benchmarking • Future work • Conclusions 12/4/2020 23

Benchmarking • JDK 1. 5 on holly. Latency (us). MPJE mpi. Java C sockets Java S. SCI 51 12 5 11 FE 161 145 83 109 Gb. E 131 101 65 86 • scidev latency is 33 us! 12/4/2020 24

Benchmarking • JDK 1. 5 on holly. Asymptotic Bandwidths (Mbps). MPJE mpi. Java C sockets Java S. SCI 1200 1480 400 360 FE 90 92 93 92 Gb. E 680 587 900 600* • scidev throughput is 1280 Mbps! 12/4/2020 25

Outline • Introduction • Design of scidev • Implementation issues • Benchmarking • Future work • Conclusions 12/4/2020 26

Future work • Immediatily: –Testing for collective communications (here only was for point-to-point) • A design with lower interdependence between xdev and mpjbuf • Get information from different formats of configuration files in SCI • Benchmarking with MPJ applications and developing MPJ and xdev applications. • New buffering implementation 12/4/2020 27

Future work Buffering System with Sbuffer and Rbuffer in ULL (still intermidiate) SBUFFER RBUFFER ULL SBUFFER RBUFFER LONG SHORT Inline 12/4/2020 SCI Inline Intermediate Queue 28

Outline • Introduction • Design of scidev • Implementation issues • Benchmarking • Future work • Conclusions 12/4/2020 29

Conclusions • Performance is still a problem –Try to avoid control message. Maybe integrating this data in the ul library –Aim: latency 30 us & Bw 1350 Mbps • Current phase in developing: Testing –Hard to do multiple initializations in a single thread (restart the device) • Design is a bit coupled with MPJ – strong interdependence • Needs evaluation and implementation using a kernel level library (threads and spawns process natively) 12/4/2020 30

Questions ? 12/4/2020 31

Appendix • Visitor at the DSG during summer 05 –Pursuing Ph. D at Univ. of A Coruña (Spain) 12/4/2020 32

Appendix • BS in Computing Tech. in 2002 at A Coruña Univ. • Member of the Computer Architecture Group. –Areas of interest of the group: – High Performance compilers (automatic detection of parallelism) – Cluster computing – Grid applications – Management of Parallel/Distributed systems – Fault tolerance in MPI – Computer graphics (rendering, radiosity) – Geographical Information Systems – 12 staff members, 8 Ph. D students 12/4/2020 33

Appendix • Computer Architecture Group. –Crossgrid (eu project within Gridstart) 12/4/2020 34

Appendix • The Computer Architecture Group is young, has an average of 32 years • Some achievements (2000 -2004): –Papers in international conferences: 102 –Papers in Journals: 53 (41 in JCR/SCI list) –Regional, national and european funded projects (+/- 1 M € in 5 years) 12/4/2020 35

Gratitudes • DSG for providing full support for my work –Specially Aamir and Raz for late, smoky and caffeinated DSG office hours –Mark for hosting the visit and his valuable support • ICG and Uo. P for the facilities and services • Bryan Carpenter for his rare but valuable comments, and his help with some JNI pbs. • DXIDI – Xunta de Galicia, for funding the visit 12/4/2020 36

A Coruña • You will be always welcome to A Coruña! 12/4/2020 37

A Coruña • You will be always welcome to A Coruña! 12/4/2020 38