A Brief Introduction to Open Fabrics Interfaces libfabric
A Brief Introduction to Open. Fabrics Interfaces - libfabric Sean Hefty
Motivation Open. Fabrics libibverbs middleware • Widely adopted low-level RDMA API • Ships with upstream Linux • Intended as unified API for RDMA Designed around Infini. Band architecture but… § Targets specific hardware implementation Hardware, not network, abstraction § Too low level for most consumers, not designed around HPC Hardware and fabric features are changing § Divergence is driving alternative APIs – UCX, PSM, MXM, CCI, PAMI, u. GNI … More applications require high-performance fabrics § Cloud systems, data analytics, virtualization, big data … 2
Solution Open. Fabrics Interfaces Working Group Application-Centric Open Source Leverage existing open source community Software interfaces aligned with application requirements • Inclusive development effort • App and HW developers • 168 requirements from MPI, PGAS, SHMEM, DBMS, sockets, NVM, … libfabric Scalable Optimized SW path to HW • Minimize cache and memory footprint • Reduce instruction count • Minimize memory accesses Implementation Agnostic Good impedance match with multiple fabric hardware • Infini. Band, i. Warp, Ro. CE, raw Ethernet, UDP offload, Omni-Path, GNI, others
Open. Fabrics Interfaces Working Group Charter: Develop an extensible, open source framework and interfaces aligned with ULP and application needs for high-performance fabric services ofiwg@lists. openfabrics. org github. com/ofiwg Application-centric interfaces will help foster fabric innovation and accelerate their adoption
Development Requirement analysis Rough conceptual model ~200 requirements MPI, PGAS, SHMEM, DBMS, sockets, … Input from wide variety of devices Quarterly release cycle Deployment Iterative design and implementation Collective feedback from OFIWG 5
Application Requirements Give us a high-level interface! Give us a low-level interface! And this was just the MPI developers! Try talking to the government!
Implementation Agnostic API Design EASY GURU Enable simple, basic usage Move functionality under OFI Advanced application constructs Expose abstract HW capabilities Range of application usage models 7
Note: current implementation focused on enabling applications Architecture Intel MPICH (Netmod) Open MPI Sandia GASNet (MTL / BTL) SHMEM Libfabric Enabled Middleware Clang UPC rsockets ES-API libfabric Communication Services Completion Services Data Transfer Services Discovery Connection Management Event Queues Message Queues RMA fi_info Address Vectors Counters Tag Matching Atomics Triggered Ops Control Services IBM Blue Gene A 3 Cube RONNIEE Sockets TCP, UDP Verbs IB, Ro. CE, i. Warp Cisco us. NIC Intel Omni -Path Cray GNI Supported or in active development Mellanox MXM Experimental 8
EASY Fabric Information Endpoint Types Capabilities MSG Message queue - FIFO - Reliable connected DGRAM - Datagram RMA RDM - Reliable unconnected, datagram messages Tagged messages Atomics Select desired endpoint type and capabilities 9
EASY Fabric Information App 1 App 2 RDM Message Queue App n. . . RDM Message Queue OFI Enabled Applications RDM Message Queue Common Implementation DGRAM Message Queue 10
GURU Fabric Information Capabilities § Application desired features and permissions § Primary – must be requested by application § Secondary – offered by provider (application can request) § Communication type – msg, tagged, rma, atomics, triggered § Permissions – local R/W, remote R/W, send/recv § Features – rma events, directed recv, multi-recv , … 11
GURU Fabric Information Attributes § Defines the limits and behavior of selected interfaces § Progress – provider or application driven § Threading – resource synchronization boundaries § Resource mgmt – protect against queue overruns Expose optimal way to use underlying hardware resources § Ordering – message processing, data transfers 12
GURU Fabric Information Mode § Provider hints on how it is best used Request application take action to improve overall performance § Local MR – must register buffers for local operations § Context – app provides ‘scratch space’ for provider to track request § Buffer prefix – app provides space for network headers 13
Endpoints Addressable communication portal EASY Conceptually similar to a socket or QP transmit receive completions Conceptual (or real) command queues Sequence of request and completion processing 14
Shared Tx/Rx Contexts GURU Enable resource manager to direct use of HW resources transmit Number of endpoints greater than available resources receive Map to command queues or HW limits (caching) 15
Scalable Endpoints Multiple Tx/Rx contexts per endpoint GURU transmit - Multi-threading - Ordering - Progress - Completions transmit receive Incoming requests may be able to target a specific receive context
API Performance Analysis Issues apply to many APIs: Verbs, AIO, DAPL, Portals, Network. Direct, … libibverbs with Infini. Band Structure Field Write Size libfabric with Infini. Band Branch? Type Parameter Write Size sge 16 void * buf 8 send_wr 60 size_t len 8 next Yes void * desc 8 num_sge Yes fi_addr_t dest_addr 8 opcode Yes void * context 8 flags Yes Totals Generic entry points result in additional memory reads/writes 76+8 = 84 4+1 = 5 Interface parameters can force branches in the provider code 40 Branch? 0 Move operation flags into initialization code path for optimal SW paths
Memory Footprint Per peer addressing data libibverbs with Infini. Band Type Data libfabric with Infini. Band Size struct * ibv_ah 8 uint 32 QPN 4 uint 32 QKey 4 [0] ibv_ah Total Type Data uint 64 fi_addr_t Size 8 24 36 Index Address Vector : • minimal footprint • requires lookup/calculation for peer address 8 Map Address Vector : • encodes peer address • direct mapping to HW command data IB Data: DLID SL QPN Size: 2 1 3
OFA Community Release Schedule 1. 0 – Q 1 2015 1. 1. 1 – Q 3 2015 • Initial release – support for PSM, Verbs (IB/i. Warp), us. NIC, socket providers • Bug fix only release • Quickly enable applications, mix of native and layered providers 1. 1 – Q 2 2015 • Bug fixes and provider enhancements 1. 2 – Q 4 2015 • New providers - enhanced verbs, Omni. Path (PSM 2), MXM, GNI 2016 • Interface extensions 19
OFIWG at SC ‘ 15 Tutorial – Monday 1: 30 – 5: 00 § Detailed look at the libfabric interface – Basic examples § Middleware implementation reference – MPI – Open. SHMEM Bo. F – Tuesday 1: 30 – 3: 00 § OFIWG, including the Data Storage/Data Access subgroup § Initial collection of interface extensions 20
Legal Notices and Disclaimers Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel. com, or from the OEM or retailer. No computer system can be absolutely secure. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http: //www. intel. com/performance. Intel, the Intel logo, and others are trademarks of Intel Corporation in the U. S. and/or other countries. *Other names and brands may be claimed as the property of others. © 2015 Intel Corporation.
INTEL® HPC DEVELOPER CONFERENCE
- Slides: 22