Performance Considerations for Packet Processing on Intel Architecture
- Slides: 26
Performance Considerations for Packet Processing on ® Intel Architecture Patrick Lu, Intel DCG/NPG/ASE Acknowledgement: Roman Dementiev, John Di. Giglio, Andi Kleen, Maciek Konstantynowicz, Sergio Gonzalez Monroy, Shrikant Shah, George Tkachuk, Vish Viswanathan, Jeremy Williamson
Agenda • Uncore Performance Monitoring with Hardware Counters • Measure PCIe Bandwidth • Measure Memory Bandwidth • Core Performance Monitoring • IPC revisit • Intel® Processor Trace for Performance Optimization
“un”core: (everything not core) LLC cache Memory controller IIO controller QPI controller and more! Core: ALU L 1 cache L 2 cache Terminology sync: Core = execution units (ALU) + L 1 cache + L 2 caches Uncore = everything not core 3
Traditional performance analysis is core centric We need new methodologies to analyze I/O centric workloads 4
Uncore Side Performance Considerations PCIe bandwidth Memory bandwidth QPI bandwidth (Backup)
Open Sourced Uncore Performance Tool Options Processor Counter Monitoring (PCM) – in Git. Hub* • real-time display on metrics, bandwidth • user space tools • cross platforms (Linux*, Windows*, Mac OSx*) Linux* perf + pmu-tools – in Linux* kernel • stand alone Linux* perf utility + wrapper 6
Measure PCIe Bandwidth 7
Measure PCIe Bandwidth Processor Counter Monitoring (PCM): . /pcm-pcie. x -e Network Rx (Ito. M/RFO) Network Tx (PCIe. Rd. Cur) MMIO Read (PRd) MMIO Write (Wi. L) Inbound PCIe write Inbound PCIe read Outbound CPU write Optimize batch size Match workload I/O throughput 1 Avoid DDIO miss Avoid MMIO Rd Reduce RFO 1. Results in number of cache lines (64 Bytes) 8
Measure Memory Bandwidth 9
Processor Counter Monitoring (PCM): . /pcm-memory. x Measure Memory Bandwidth Reason with memory traffic: 1) Wrong NUMA allocation 2) DDIO Miss 3) CPU data structure No memory traffic: Best 1) Fix it in the code! 2) Receive packets ASAP, descriptor ring sizes (needs tuning) 3) May be okay, but check for latency! 10
Uncore Performance Takeaway Easy Calibrate I/O bandwidth with expected performance Check for unexpected cross-NUMA traffic Avoid MMIO Read Batch MMIO Write Advance Optimize Software to make use of DDIO 11
Core Side Performance Consideration IPC Intel® Processor Trace
Which Workload Has Best Performance? IPC VPP A 2. 63 VPP B 2. 12
Workload Performance Reveal: A: Un-optimized VPP IPSec, B: Intel® AES-NI Accelerated VPP IPSec IPC VPP A 2. 63 Instructions/packet 15733 8 x reduction in instructions VPP B 2. 12 1865 Cycles/packet 5982 6 x reduction in cycles 880
Performance is Determined by Two Factors Instruction per Cycle (IPC) Mystery Instructions/pkt Measured Instructions / sec (M) Measured Cycles / sec (M) Derived IPC Measured CPU freq. (MHz) Measured Packets / sec (M) Derived Instructions / packet Derived Cycles / packet 5229 2103 2. 49 2100 9. 7 539 217 ÷ Cycles/pkt = IPC Unless IPC is really poor, instructions/packet or cycles/packet is more realistic Core metrics. 15
Intel® Processor Trace (Intel® PT) Components Intel CPU 0. . n Intel PT HW Intel® Processor Trace (Intel® PT) packet log, binaries, and software runtime data are used to reconstruct the precise execution flow Intel PT packet log (per logical processor) Configure & enable Intel PT Software Decoder Intel PTenabled Tools ® Processor Agent Intel. Ring 0 Trace can reveal exact execution path Runtime data, including: (OS, VMM, BIOS, Driver, …) • Map linear-address to image files • Map CR 3 value to application • Log module load/unload and JIT info Binary Image Files 16
Linux* perf and Intel® Processor Trace Provides basic framework for trace collection, mining, and visualization § Kernel driver in 4. 1, user tools in 4. 2/4. 3 Supports full trace and “snapshot” modes Display tools support histograms, event lists, flame graphs, etc Linux perf can capture finest detail with Intel® PT. Post processing is required to reduce trace data. 17
Dense e_ i 4 rec 0 e v _ _ i 4 rec sca 0 e v tte _ _ r i 4 rec sca ed_ 0 e v tte p k _ _ r i 4 rec sca ed_ ts_ 0 e v tte p ve k _ _ c r i 4 rec sca ed_ ts_ 0 e v tte p ve k _r _s c r ec ca ed_ ts_ v_ tte p ve sc re kts c at d_ _v te pk e re t c dp d_ s_v p et dk kt ec h _ s l 2 ern inp _ve in et ut p _ c _ l 2 ut_ inp avx le no ut 2 ar _ n de a l 2 _no _fn vx 2 f d _a l w vn 2 ou d_n e_f vx et tp od n_ 2 _i ut nt _n e_f avx e n 2 dp rfac ode _av dk e_ _fn x 2 _i ou _ a nt e tp v i 4 rfa ut_ x 2 0 e ce n _ _ o i 4 xm tx_ de 0 e it av _ _ i 4 xm pkts x 2 0 e it _v _ _ i 4 xm pkts ec 0 e it _v _ _ i 4 xm pkts ec 0 e it _v _ _ i 4 xm pkts ec 0 e it _v _ _ i 4 xm pkts ec 0 e it _ _x _p ve m kts c it_ _v pk ec ts _v ec i 4 0 Instructions / packet Life Cycle of a Packet using Intel® PT VPP L 2 bridge domain 100 90 80 70 60 50 40 30 20 10 0 Rx 256 packets = 32 -packet rx burst * 8 loops 87 49 50 52 82 47 256 packets = 16 -packet tx burst * 16 loops 23 30 packet p Not the Processing most o ptimal driver p ath rocess Tx ing fun ctions 18
Core Performance Takeaway 1. Use perfmon hardware to baseline workload performance. PCM, Linux* Perf (pmu-tools/{toplev. py, ucevents. py}, intel_pt) 2. Reduce instruction counts per packet Create more direct VPP path to process packets Consider hardware offload 3. Reduce cycles per packet Follow uncore performance takeaway to reduce latency Measure CPU L 1 cache miss latency and look for opportunity to prefetch it (not covered) 19
References • Xeon® E 5 -v 4 Uncore Performance Monitor Unit (PMU) guide: https: //software. intel. com/en-us/blogs/2014/07/11/documentation-for-uncoreperformance-monitoring-units • Intel® 64 and IA-32 Architectures Optimization Reference Manual: http: //www. intel. com/content/www/us/en/architecture-and-technology/64 -ia-32 architectures-optimization-manual. html • Processor Counter Monitoring (PCM): https: //github. com/opcm/pcm • pmu-tools: https: //github. com/andikleen/pmu-tools • Intel® Processor Trace Linux* perf support: https: //github. com/torvalds/linux/blob/master/tools/perf/Documentation/intelpt. txt 20
Backup 22
Measure QPI (Cross Sockets) Bandwidth 23
Measure QPI (cross sockets) Bandwidth Processor Counter Monitoring (PCM): . /pcm. x -nc Any QPI traffic will inevitably hurt latency, if not throughput. QPI traffic is determined by placements of CPU, memory and I/O. If unavoidable, watch for link utilization. 24
TMAM Top-Down Analysis Methodology (TMAM)
Linux perf + pmu-tools Perfmon device abstractions layers reside in /sys/devices Measure Memory Bandwidth with Linux perf: Measure PCIe Bandwidth with Linux perf (events pending upstream): . /ucevent. py --scale MEGA i. MC. MEM_BW_READS i. MC. MEM_BW_WRITES #Event not upstreamed yet. /ucevent. py --scale MEGA CBO. CPU_OUTBOUND_MMIO_READ CBO. CPU_OUTBOUND_MMIO_WRITE CBO. PCIE_INBOUND_READ_HIT_DDIO CBO. PCIE_INBOUND_READ_MISS_DDIO CBO. PCIE_INBOUND_WRITE_HIT_DDIO CBO. PCIE_INBOUND_WRITE_MISS_DDIO Measure QPI Bandwidth with Linux perf: . /ucevent. py --scale MEGA QPI_LL. QPI_DATA_BW QPI_LL. QPI_LINK_BW 26
- Biu 8086
- Packet processing
- Vpp cisco
- Math kernel libraries
- Dumpcapui
- Ax register in 8086
- Intel pentium processor architecture
- Intel itanium architecture
- Intel core processor architecture
- Deep packet inspection architecture
- Deep packet inspection architecture
- General consideration in machine design
- Tax considerations for setting up a new business
- Emv kiosk considerations
- Three considerations of societal marketing concept
- Mechanical considerations of transmission line
- Database design considerations
- Psychomotor considerations language acquisition
- Cloud delivery models
- Writing strategies and ethical considerations
- Preanalytical considerations phlebotomy
- Bioreactor considerations for animal cell culture
- Collaboration design considerations
- Margin design in fpd
- Antecubial region
- Biopharmaceutic considerations in drug product design
- Retromylohyoid ridge