Network Driver Performance Outline Software features for high

  • Slides: 20
Download presentation
Network Driver Performance

Network Driver Performance

Outline Software features for high performance NICs Some of the top features include: Scatter-Gather

Outline Software features for high performance NICs Some of the top features include: Scatter-Gather DMA Automatic Tuning of resources Task Offloading support for IPv 6 Hardware features for high performance NICs Some of the top features include: Task Offloading support Receive-Side Scaling (RSS) support Performance Tools NTttcp Kernrate Profiler

Goals This information can be used to optimally tune your network driver to work

Goals This information can be used to optimally tune your network driver to work with your hardware for best networking performance This information can be used to fine-tune your hardware features to operate at its optimal performance How to use NTttcp to isolate Network performance problems How to use Kernrate to identify bottlenecks on hot paths Note: The mention of packets is relevant to NDIS 5. x drivers and translates to Net. Buffers and Net. Buffer. Lists for NDIS 6. 0 drivers on Windows codenamed “Longhorn”

Software Optimizations

Software Optimizations

Network Software Optimizations Scatter Gather DMA SG DMA yields optimum performance with NDIS 6.

Network Software Optimizations Scatter Gather DMA SG DMA yields optimum performance with NDIS 6. 0 model It is highly recommended to pre-allocate the buffer hosting the SCATTER_GATHER_LIST as part of Transmit Control Block during the initialization phase and reuse it. Use maximum buffer size for Maximum. Physical. Mapping parameter in Ndis. MInitialize. Scatter. Gather. Dma function to avoid buffer allocation and copy Using Cached Memory to allocate NIC receive buffers X 86, IA 64, and x 64 hardware guarantees DMA coherency and there is no need to call Io. Flush. Buffer since it would become a nop Ndis. MAllocate. Shared. Memory ( p. Mp. Rxbuf->Alloc. Size, TRUE, // CACHED &p. Mp. Rxbuf->Alloc. Va, &p. Mp. Rxbuf->Alloc. Pa);

More Network Software Optimizations NDIS Safe APIs Required for NDIS 6. 0 model! It

More Network Software Optimizations NDIS Safe APIs Required for NDIS 6. 0 model! It has shown overall TCP/IP improvements of up to 7% in Kernel mode scenarios (e. g. IIS 6. 0) Eliminate the need to call into Kernel for probing and locking buffer Set NDIS_ATTRIBUTES_USES_SAFE_BUFFER_APIS flag in Ndis. MSet. Attributes. Ex for NDIS 5. x drivers. The flag does not need to be set for NDIS 6. 0 drivers Example: When using Ndis. Query. Buffer. Safe, the Virtual. Address parameter should be set to NULL to avoid mapping of buffers sent down by NDIS 64 -bit DMA Support Avoid copies for addresses above the 4 GB range by setting Dma 64 Addresses to TRUE in Ndis. MInitialize. Scatter. Gather. Dma

Locking Mechanisms Optimizations Expensive hit to system performance if not used properly Measurements show

Locking Mechanisms Optimizations Expensive hit to system performance if not used properly Measurements show that we use approximately 160 cycles for Lock Acquires and 140 cycles for Lock Releases. Spinlocks should be used to protect data and not code. Locking at DPC Level When at DPC level, avoid extra code by using the following: Ndis. Dpr. Acquire. Spinlock Ndis. Dpr. Release. Spinlock Reader-Write Locks To minimize the number of spinlock acquire and release operations, use the NDIS Read. Write. Lock functions for scalability: Ndis. Initialize. Read. Write. Lock Ndis. Acquire. Read. Write. Lock Ndis. Release. Read. Write. Lock The Read-Write Locks allow multiple concurrent readers to use a single lock and limit write access to a single writer thread. No read access is allowed during a write access. They will still behave like a spinlock and raise the IRQL to dispatch when acquired.

Auto Tuning Network Drivers Static: Driver and NIC hardware parameters are based on system

Auto Tuning Network Drivers Static: Driver and NIC hardware parameters are based on system configuration such as whether it is a client or server machine, CPU, memory, and what can the NIC do. Dynamic: System conditions dictate what type of tuning is necessary for optimum performance. It uses resource utilization and network load as metrics for determining the best operating points for the NIC and driver. Some of the primary auto tuning parameters include: Interrupt moderation Receive Buffers allocation Small buffer coalescing Packets processed per DPC Drivers can obtain current processor utilization by using the Ndis. Get. Current. Processor. Counts function.

Hardware Optimizations

Hardware Optimizations

Task Offload Support Checksum Offload It has shown to improve overall TCP/IP performance by

Task Offload Support Checksum Offload It has shown to improve overall TCP/IP performance by up to 20% It improves caching effect and eliminates churning – 8% increase It reduces code path length – 12% improvement TCP Segmentation Offload It has shown to improve overall TCP/IP performance by up to 11% Reduces sender Cycles per Byte cost by 2 x (it goes below 1. 5) NDIS 6. 0 has support for successor: Giant Send Offload (> 64 K) NDIS 6. 0 has IPv 6 support for TCP Segmentation Offload NDIS 6. 0 offers support for IPSec Offload

Message Signaled Interrupts (MSI) MSI has the following attributes: No acknowledgment is necessary for

Message Signaled Interrupts (MSI) MSI has the following attributes: No acknowledgment is necessary for the message No sharing is usually necessary There is support for many interrupts per PCI function Caveat: It only works on P 4 and later chipsets Advantages of MSI With no sharing in place, latency is less with a single ISR running Bus utilization goes down by eliminating some read operations from device Device can target interrupts at designated processors (e. g. RSS) It guarantees data buffer coherency because message follows DMA traffic on bus

Receive Side Scaling (RSS) Existing stack limits receive processing to one CPU Restricts scalability

Receive Side Scaling (RSS) Existing stack limits receive processing to one CPU Restricts scalability of Web server to the number of short-lived connections a single CPU can process (per NIC) Limits transaction throughput to packet receive processing rate of one CPU Example: A four processor machine can not use more than 25% of its overall CPU cycles when hosting a single NIC on the system RSS helps both long and short-lived connections At times when CPU processing is dominated by connection setup, RSS improves performance Connection setup tasks map well to a general purpose CPU RSS gives us parallel receive processing = parallel DPCs Planned availability in Windows Server 2003 Network Scalable Pack Add-on and Longhorn

Receive Side Scaling Today NDIS CPU 0 ISR NDIS DPC NDIS CPU 1 DPC

Receive Side Scaling Today NDIS CPU 0 ISR NDIS DPC NDIS CPU 1 DPC NDIS CPU 2 DPC Parallel Receive Packet Queues NIC One processor per NIC Multiple processors per NIC

Network Performance Tools NTttcp benchmark Uses Winsock 2. x publicly available APIs Uses Overlapped

Network Performance Tools NTttcp benchmark Uses Winsock 2. x publicly available APIs Uses Overlapped I/O and Multithreading model Transfers random data from Memory to Memory Provides Throughput, CPU, and Interrupt rate Provides Cycles per Byte metric - key for measuring performance to catch regressions Provides Packet to ACK ratio to detect link condition Provides number of Segment Retransmits and Errors Supports all Windows hardware architectures

NTttcp Output for a Single Thread

NTttcp Output for a Single Thread

NTttcp Output for Multiple Threads

NTttcp Output for Multiple Threads

More Network Performance Tools Kernrate Profiling tool General purpose profiler for tracking CPU utilization

More Network Performance Tools Kernrate Profiling tool General purpose profiler for tracking CPU utilization Samples periodically (programmable) to see what is executing Adjustable granularity Per-processor, per-process, and total Supports all Windows hardware architectures Supports Windows 2000 and beyond Highly customizable (numerous options) The profiling tool and its viewer (Kr. View) can be downloaded from: http: //www. microsoft. com/whdc/system/sysperf/krview. mspx

Call To Action NDIS 6. 0 driver developers need to implement Task Offloading support

Call To Action NDIS 6. 0 driver developers need to implement Task Offloading support for IPv 6 Fine-tune your hardware so it operates at its optimal performance point Fine-tune your network driver to work optimally with your hardware for best performance For questions, please e-mail ndis 6 fb @ microsoft. com. Please include your name, company name, and phone number

Additional Resources Email: ndis 6 fb @ microsoft. com Web Resources: Analyzing Driver Performance:

Additional Resources Email: ndis 6 fb @ microsoft. com Web Resources: Analyzing Driver Performance: http: //www. microsoft. com/whdc/driver/perform/ drvperf. mspx High Performing Adapters and Drivers whitepaper: http: //www. microsoft. com/whdc/device/network/ Net. Adapters-Drvs. mspx Kernrate is available for download from the following: http: //www. microsoft. com/whdc/system/sysperf/krview. mspx

© 2005 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only.

© 2005 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.