Performance Considerations for Packet Processing on Intel Architecture

  • Slides: 26
Download presentation
Performance Considerations for Packet Processing on ® Intel Architecture Patrick Lu, Intel DCG/NPG/ASE Acknowledgement:

Performance Considerations for Packet Processing on ® Intel Architecture Patrick Lu, Intel DCG/NPG/ASE Acknowledgement: Roman Dementiev, John Di. Giglio, Andi Kleen, Maciek Konstantynowicz, Sergio Gonzalez Monroy, Shrikant Shah, George Tkachuk, Vish Viswanathan, Jeremy Williamson

Agenda • Uncore Performance Monitoring with Hardware Counters • Measure PCIe Bandwidth • Measure

Agenda • Uncore Performance Monitoring with Hardware Counters • Measure PCIe Bandwidth • Measure Memory Bandwidth • Core Performance Monitoring • IPC revisit • Intel® Processor Trace for Performance Optimization

“un”core: (everything not core) LLC cache Memory controller IIO controller QPI controller and more!

“un”core: (everything not core) LLC cache Memory controller IIO controller QPI controller and more! Core: ALU L 1 cache L 2 cache Terminology sync: Core = execution units (ALU) + L 1 cache + L 2 caches Uncore = everything not core 3

Traditional performance analysis is core centric We need new methodologies to analyze I/O centric

Traditional performance analysis is core centric We need new methodologies to analyze I/O centric workloads 4

Uncore Side Performance Considerations PCIe bandwidth Memory bandwidth QPI bandwidth (Backup)

Uncore Side Performance Considerations PCIe bandwidth Memory bandwidth QPI bandwidth (Backup)

Open Sourced Uncore Performance Tool Options Processor Counter Monitoring (PCM) – in Git. Hub*

Open Sourced Uncore Performance Tool Options Processor Counter Monitoring (PCM) – in Git. Hub* • real-time display on metrics, bandwidth • user space tools • cross platforms (Linux*, Windows*, Mac OSx*) Linux* perf + pmu-tools – in Linux* kernel • stand alone Linux* perf utility + wrapper 6

Measure PCIe Bandwidth 7

Measure PCIe Bandwidth 7

Measure PCIe Bandwidth Processor Counter Monitoring (PCM): . /pcm-pcie. x -e Network Rx (Ito.

Measure PCIe Bandwidth Processor Counter Monitoring (PCM): . /pcm-pcie. x -e Network Rx (Ito. M/RFO) Network Tx (PCIe. Rd. Cur) MMIO Read (PRd) MMIO Write (Wi. L) Inbound PCIe write Inbound PCIe read Outbound CPU write Optimize batch size Match workload I/O throughput 1 Avoid DDIO miss Avoid MMIO Rd Reduce RFO 1. Results in number of cache lines (64 Bytes) 8

Measure Memory Bandwidth 9

Measure Memory Bandwidth 9

Processor Counter Monitoring (PCM): . /pcm-memory. x Measure Memory Bandwidth Reason with memory traffic:

Processor Counter Monitoring (PCM): . /pcm-memory. x Measure Memory Bandwidth Reason with memory traffic: 1) Wrong NUMA allocation 2) DDIO Miss 3) CPU data structure No memory traffic: Best 1) Fix it in the code! 2) Receive packets ASAP, descriptor ring sizes (needs tuning) 3) May be okay, but check for latency! 10

Uncore Performance Takeaway Easy Calibrate I/O bandwidth with expected performance Check for unexpected cross-NUMA

Uncore Performance Takeaway Easy Calibrate I/O bandwidth with expected performance Check for unexpected cross-NUMA traffic Avoid MMIO Read Batch MMIO Write Advance Optimize Software to make use of DDIO 11

Core Side Performance Consideration IPC Intel® Processor Trace

Core Side Performance Consideration IPC Intel® Processor Trace

Which Workload Has Best Performance? IPC VPP A 2. 63 VPP B 2. 12

Which Workload Has Best Performance? IPC VPP A 2. 63 VPP B 2. 12

Workload Performance Reveal: A: Un-optimized VPP IPSec, B: Intel® AES-NI Accelerated VPP IPSec IPC

Workload Performance Reveal: A: Un-optimized VPP IPSec, B: Intel® AES-NI Accelerated VPP IPSec IPC VPP A 2. 63 Instructions/packet 15733 8 x reduction in instructions VPP B 2. 12 1865 Cycles/packet 5982 6 x reduction in cycles 880

Performance is Determined by Two Factors Instruction per Cycle (IPC) Mystery Instructions/pkt Measured Instructions

Performance is Determined by Two Factors Instruction per Cycle (IPC) Mystery Instructions/pkt Measured Instructions / sec (M) Measured Cycles / sec (M) Derived IPC Measured CPU freq. (MHz) Measured Packets / sec (M) Derived Instructions / packet Derived Cycles / packet 5229 2103 2. 49 2100 9. 7 539 217 ÷ Cycles/pkt = IPC Unless IPC is really poor, instructions/packet or cycles/packet is more realistic Core metrics. 15

Intel® Processor Trace (Intel® PT) Components Intel CPU 0. . n Intel PT HW

Intel® Processor Trace (Intel® PT) Components Intel CPU 0. . n Intel PT HW Intel® Processor Trace (Intel® PT) packet log, binaries, and software runtime data are used to reconstruct the precise execution flow Intel PT packet log (per logical processor) Configure & enable Intel PT Software Decoder Intel PTenabled Tools ® Processor Agent Intel. Ring 0 Trace can reveal exact execution path Runtime data, including: (OS, VMM, BIOS, Driver, …) • Map linear-address to image files • Map CR 3 value to application • Log module load/unload and JIT info Binary Image Files 16

Linux* perf and Intel® Processor Trace Provides basic framework for trace collection, mining, and

Linux* perf and Intel® Processor Trace Provides basic framework for trace collection, mining, and visualization § Kernel driver in 4. 1, user tools in 4. 2/4. 3 Supports full trace and “snapshot” modes Display tools support histograms, event lists, flame graphs, etc Linux perf can capture finest detail with Intel® PT. Post processing is required to reduce trace data. 17

Dense e_ i 4 rec 0 e v _ _ i 4 rec sca

Dense e_ i 4 rec 0 e v _ _ i 4 rec sca 0 e v tte _ _ r i 4 rec sca ed_ 0 e v tte p k _ _ r i 4 rec sca ed_ ts_ 0 e v tte p ve k _ _ c r i 4 rec sca ed_ ts_ 0 e v tte p ve k _r _s c r ec ca ed_ ts_ v_ tte p ve sc re kts c at d_ _v te pk e re t c dp d_ s_v p et dk kt ec h _ s l 2 ern inp _ve in et ut p _ c _ l 2 ut_ inp avx le no ut 2 ar _ n de a l 2 _no _fn vx 2 f d _a l w vn 2 ou d_n e_f vx et tp od n_ 2 _i ut nt _n e_f avx e n 2 dp rfac ode _av dk e_ _fn x 2 _i ou _ a nt e tp v i 4 rfa ut_ x 2 0 e ce n _ _ o i 4 xm tx_ de 0 e it av _ _ i 4 xm pkts x 2 0 e it _v _ _ i 4 xm pkts ec 0 e it _v _ _ i 4 xm pkts ec 0 e it _v _ _ i 4 xm pkts ec 0 e it _v _ _ i 4 xm pkts ec 0 e it _ _x _p ve m kts c it_ _v pk ec ts _v ec i 4 0 Instructions / packet Life Cycle of a Packet using Intel® PT VPP L 2 bridge domain 100 90 80 70 60 50 40 30 20 10 0 Rx 256 packets = 32 -packet rx burst * 8 loops 87 49 50 52 82 47 256 packets = 16 -packet tx burst * 16 loops 23 30 packet p Not the Processing most o ptimal driver p ath rocess Tx ing fun ctions 18

Core Performance Takeaway 1. Use perfmon hardware to baseline workload performance. PCM, Linux* Perf

Core Performance Takeaway 1. Use perfmon hardware to baseline workload performance. PCM, Linux* Perf (pmu-tools/{toplev. py, ucevents. py}, intel_pt) 2. Reduce instruction counts per packet Create more direct VPP path to process packets Consider hardware offload 3. Reduce cycles per packet Follow uncore performance takeaway to reduce latency Measure CPU L 1 cache miss latency and look for opportunity to prefetch it (not covered) 19

References • Xeon® E 5 -v 4 Uncore Performance Monitor Unit (PMU) guide: https:

References • Xeon® E 5 -v 4 Uncore Performance Monitor Unit (PMU) guide: https: //software. intel. com/en-us/blogs/2014/07/11/documentation-for-uncoreperformance-monitoring-units • Intel® 64 and IA-32 Architectures Optimization Reference Manual: http: //www. intel. com/content/www/us/en/architecture-and-technology/64 -ia-32 architectures-optimization-manual. html • Processor Counter Monitoring (PCM): https: //github. com/opcm/pcm • pmu-tools: https: //github. com/andikleen/pmu-tools • Intel® Processor Trace Linux* perf support: https: //github. com/torvalds/linux/blob/master/tools/perf/Documentation/intelpt. txt 20

Backup 22

Backup 22

Measure QPI (Cross Sockets) Bandwidth 23

Measure QPI (Cross Sockets) Bandwidth 23

Measure QPI (cross sockets) Bandwidth Processor Counter Monitoring (PCM): . /pcm. x -nc Any

Measure QPI (cross sockets) Bandwidth Processor Counter Monitoring (PCM): . /pcm. x -nc Any QPI traffic will inevitably hurt latency, if not throughput. QPI traffic is determined by placements of CPU, memory and I/O. If unavoidable, watch for link utilization. 24

TMAM Top-Down Analysis Methodology (TMAM)

TMAM Top-Down Analysis Methodology (TMAM)

Linux perf + pmu-tools Perfmon device abstractions layers reside in /sys/devices Measure Memory Bandwidth

Linux perf + pmu-tools Perfmon device abstractions layers reside in /sys/devices Measure Memory Bandwidth with Linux perf: Measure PCIe Bandwidth with Linux perf (events pending upstream): . /ucevent. py --scale MEGA i. MC. MEM_BW_READS i. MC. MEM_BW_WRITES #Event not upstreamed yet. /ucevent. py --scale MEGA CBO. CPU_OUTBOUND_MMIO_READ CBO. CPU_OUTBOUND_MMIO_WRITE CBO. PCIE_INBOUND_READ_HIT_DDIO CBO. PCIE_INBOUND_READ_MISS_DDIO CBO. PCIE_INBOUND_WRITE_HIT_DDIO CBO. PCIE_INBOUND_WRITE_MISS_DDIO Measure QPI Bandwidth with Linux perf: . /ucevent. py --scale MEGA QPI_LL. QPI_DATA_BW QPI_LL. QPI_LINK_BW 26