Rearchitecting the data center the Intel Xeon Scalable
Re-architecting the data center: the Intel® Xeon® Scalable Processor Product 30 -3 -30
A Glimpse Inside the Intel® Xeon® Scalable platform Fabric Intel® Omni-Path Architecture Networking Intel® Ethernet Accelerators Intel® Quick. Assist Intel® AVX-512 SSDs Intel® Optane™ SSD DC P 4800 X Complementary Intel® FPGA INTEGRATED OPTIONS Workload optimized frameworks & telemetry (e. g. Caffe*, Intel® DAAL, Intel® MKL, DPDK, SNAP*, SPDK) perform ance security Agility Advancing virtually every aspect: Brand New core, cache, on-die interconnects, memory controller & more Intel® Advanced Vector Extensions 512 (Intel® DAAL) Data Plane Development Kit (DPDK) AVX-512) Intel® Resource Director Technology (Intel® Volume Management Device (Intel® Math Kernel Library (Intel® MKL) VMD) Storage Performance Development Kit (SPDK) RDT) Intel® Data Analytics Acceleration Library 4
Breakthrough CPU Design: Intel® Mesh Architecture Ring Architecture Mesh Architecture ü Maximizes performance ü Enables consistent, low latencies ü Optimized for data sharing and memory access between all CPU cores/threads for ideal memory bandwidth and capacity ü Data flows scale efficiently for 2, 4 & 8+ socket configurations 2009 -2017+ New in 2017 ü Designed for modern virtualized and hybrid cloud implementations Designed for next-generation Data Centers 5
INTEL® XEON® SCALABLE processors The Foundation for Agile, Secure, Workload-Optimized st t. Hybrid Cloud e a B e gr 28 SOCKET UPI 3 LINKS 8&SUPPORT 2, 4 CORES UP TO WITH UP TO 22 2 CORES &4 UP TO SOCKET SUPPORT SCALABLE AT LOW POWER PERFORMANC ENT Goo STANDARD RAS E RY d SCALABLE HARDWARE-ENHAN PERFORMANCE STANDARD RAS Light TASKS MODERATE TASKS 3 266 1. 5 RELIABILITY, AVAILABILITY ACCELERATOR 6 TB FOR LIGHT WORKLOA ADVANCE AND SERVICEABILITYFOR MODERATE WORKLOADS HIGH THROUGHPUT Efficient ENTRY EST MAINSTREAMD DDR 4 M WITH H UP TO Z TOPLINE MEMORY UPI LINKS CHANNEL BANDWIDTH UP TO INTEL® TURBO BOOST TECHNOLOGY AND ENTRY PERFORMANCE INTEL® HYPER-THREADING TECHNOLOGY 10
Processors: Intel® Xeon® PLATINUM PROCESSOR 81 XX Series Platinum (2, 4, and 8 Xeon® An Agile Solution Stack for Data. Intel® Center Workloads Intel® Xeon® BRONZE PROCESSOR 31 XX Series Bronze (2 Socket) • Up to 8 cores • 2 S configuration • Improved core interconnect (UPI) Intel® Xeon® SILVER PROCESSOR 41 XX Series Silver (2 Socket) + Up to 12 cores + 2 S configuration with Improved Memory channel performance + Intel® Turbo Boost Technology for higher frequency capability + Intel® HT Technology for hyper threaded workloads over past gen • 48 PCIe 3. 0 lanes • Intel® AVX-512 feature enabled • Standard RAS features Entry Performance and security for price sensitive deployments Efficient Performance at Low Power. Provides more horsepower for single purpose workloads GOLD PROCESSOR 61 XX Series Gold (2 and 4 Socket) + Up to 22 Cores + Added 3 rd UPI link for increased dataflow across cores + Increased performance across memory channels 6 + Intel® AVX-512 with additional FMA + Added Node Controller Support to assist in scaled node management 51 XX Series Gold (2 and 4 Socket) + Up to 14 cores + Supports 2 S and 4 S configuration for increased scalability + Increased core interconnect speed to boost data flow in multi-processor workloads + Advanced RAS features Mainstream Performance, Fast Memory, More Interconnect Engines, Advanced Reliability Socket) + Up to 28 Cores + 2, 4, or 8 socket configurations for best performance and scalability 5 + Topline memory channel performance (1. 5 TB memory bandwidth on select SKUs) + 3 UPI links option across 2 S, 4 S, 8 S for improved scalability and intercore data flow The Best Performance, Scalability, Core options, and all Hardware-Enhanced Security features for the most robust capability Better performance, interconnectivity, scalability, and memory 11
Intel® Xeon® Scalable Capability Processors and Feature SUPPORT Intel® Xeon® Bronze Processor (3100 Series) Pervasive Performance and Security Highest Core Count Supported Highest Supported Frequency Number of CPU Sockets Supported Intel® Ultra Path Interconnect (UPI) Speed Intel® Advanced Vector Extensions 512 (AVX-512) Memory Speed Support (DDR 4) Highest Memory Capacity Supported Per Socket Intel® Omni-Path Architecture (Discrete) Intel® Quick. Assist Technology (Integrated) Intel® Quick. Assist Technology (Discrete) Intel® Optane Technology-based SSDs (3 D XPoint™) Intel® SSD Data Center Family (3 D NAND) PCIe 3. 0 (48 lanes) Intel® Quick. Data Technology (CBDMA) Non-Transparent Bridge (NTB) Intel® Turbo Boost Technology 2. 0 Intel® Hyper-Threading Technology Node Controller Support Intel® Omni-Path Architecture (Integrated) Intel® Xeon® Silver Processor (4100 Series) 8 cores 1. 7 GHz (8 C/85 W) Up to 2 2 9. 6 GT/s 1 FMA 2133 MHz 768 GB 12 cores 2. 2 GHz (10 C/85 W) Up to 2 2 9. 6 GT/s 1 FMA 2400 MHz 768 GB Standard Intel® Xeon® Gold Processor (5100 Series) 14 cores 3. 6 GHz (4 C/105 W) Up to 4 2 10. 4 GT/s 1 FMA 1 2400 MHz 1 768 GB Intel® Xeon® Gold Processor (6100 Series) Intel® Xeon® Platinum Processor (8100 Series) 22 cores 3. 4 GHz (6 C/115 W) Up to 4 3 10. 4 GT/s 2 FMA 2666 MHz 768 GB, 1. 5 TB 2 28 cores 3. 6 GHz (4 C/105 W) Up to 8 3 10. 4 GT/s 2 FMA 2666 MHz 768 GB, 1. 5 TB 2 Advanced High Reliability, Accessibility and Serviceability (RAS) Capability Intel® Run Sure Technology Agility & Efficiency Intel® Volume Management Device (VMD) Intel® Resource Director Technology (RDT) Intel® Speed Shift Technology Intel® Node Manager 4. 0 Security Mode-based Execute Control Intel® Key Protection Technology (KPT) w/Integrated Intel® QAT Intel® Platform Trust Technology (PTT) Intel® TXT w/One-Touch Activation (OTA) 1 Advanced Note: Select Gold Processor(s) will support 2666 DDR 4 and 2 512 -bit FMA units. Select SKUs support 1. 5 TB per socket memory capacity 2 Note: 12
Features for HPC, enterprise, cloud, & comms providers 14
Intel® Xeon® Scalable Processors for powerful and balanced integrated converged parallel Technical Computing (HPC) performance for diverse hpc interconnect for programming workloads Powerful performance ü Up to 28 cores vs. 24 cores/22 cores (on Intel® Xeon® processor E 7 v 4 / Intel Xeon processor E 5 -2600 v 4 families) ü Intel® AVX-512 delivers up to 2 X FLOPs/clock-cycle peak performance capability optimized for HPC, data analytics, and cryptography workloads 1 ü New Intel® Mesh architecture with 3 Intel® Ultra Path Interconnect lanes provides greater inter-CPU bandwidth for the most datahungry, latency-sensitive applications Significantly increased memory and I/O bandwidth ü Up to 1. 5 x gen-to-gen memory bandwidth increase per CPU (6 memory channels) for extremely large compute- and data-intensive workloads ü More IO bandwidth with 48 PCIe 3. 0 lanes vs. 40 lanes on Intel Xeon processor E 5 -2600 v 4 ü Intel® Optane™ and Intel® 3 D NAND solid state drives deliver industry-leading combination of high throughput, low latency, high quality of service (Qo. S), and ultra high endurance 6 to break data access bottlenecks compelling Integrated Intel® Omni. Path Architecture designed efficiency for today’s HPC systems ü Provides 100 Gbps highbandwidth and low-latency fabric for HPC clusters ü Reduces number of required switches and lowers fabric costs 7, freeing up budget for up to 24% more compute nodes 8 ü Denser 48 -port switch chip delivers a 33 percent increase 9 over traditional Infini. Band switch, resulting in power, space and maintenance savings environment for Intel® Xeon® scalable Highly integrated portfolio of processors & Intel® superior technologies and optimized software tools ensures Xeoncode PHi™ processors portability across IA solutions ü Intel AVX-512 enables converged programming environment for Intel Xeon Scalable Processor and Intel® Xeon Phi™ Processor compute nodes ü Intel® Modern Code Developer Program enables the next decade of discovery ü Intel® Parallel Studio XE 2017 upgrades developer toolkit for HPC and technical computing ü Intel® HPC Orchestrator simplifies installation and ongoing maintenance of HPC system software stack For footnotes and configurations, see slides 29 -30. 15
Intel® Xeon® Scalable Performance optimized for security without Processors for Enterprise extreme workload efficiency and compromise across a agility broad set of enterprise Per core performance improvements ü New core micro-architecture with new & optimized low-latency cache hierarchy design Socket-level performance improvements ü Up to 28 cores vs. 24 cores/22 cores (on Intel® Xeon® processor E 7 v 4 / Intel Xeon processor E 5 -2600 v 4 families) ü New Intel® Mesh architecture with 3 Intel® Ultra Path Interconnect lanes provides greater inter-CPU bandwidth for the most data-hungry, latency-sensitive applications Workload acceleration ü Intel® AVX-512 delivers up to 2 X more FLOPs/clock-cycle for HPC, Analytics, Cryptography, and Data Compression workloads 1 ü Integrated Intel® Ethernet Connection X 722 with i. WARP RDMA (up to 4 x 10 Gb. E) Increased bandwidth ü More memory bandwidth, with up to DDR 4 -2666 MHz and 6 memory channels vs. Intel Xeon processor E 5 -2600 v 4 product family ü More IO bandwidth, with 48 PCIe 3. 0 lanes vs. 40 lanes on Intel Xeon processor E 5 -2600 v 4 Security efficiencies reduce ISV and end-customer cost workloads ü Intel® AVX-512 delivers up to 2 x increased per core performance on key encryption algorithms to accelerate cryptography and delivery of hardware-enhanced security 2 ü Enhanced platform security with Intel® Trusted Execution Technology (TXT) w/One-Touch Activation, Intel® Platform Trust Technology (PTT), Intel® Boot Guard and available Intel® Key Protection Technology 3 High reliability for maximum server uptime and platform resilience highly versatile and disruptive platform design to Game-changing improvements drive across business infrastructure in performance and innovation latency offer up to 4 65% lower total cost of ownership 5 ü Designed for five 9’s (99. 999%) availability with over 70 RAS (Reliability, Availability, Serviceability) features including enhanced Intel® Run Sure Technology (Advanced RAS), to deliver mission-critical platforms ü 2, 4 and 8 socket scalable natively and 8 S+ with 3 rd party node controllers For footnotes and configurations, see slides 29 -30. 16
Intel® Xeon® Scalable Processors for Agile, trusted service Differentiated Workload-optimized performance: Cloud Service Providers delivery: services to Disruptive efficiency by design Socket-level performance improvements ü Up to 28 cores vs. 24 cores/22 cores (on Intel® Xeon® processor E 7 v 4 /Intel Xeon processor E 5 -2600 v 4 families) ü New Intel® Mesh architecture with 3 Intel® Ultra Path Interconnect lanes provides greater inter-CPU bandwidth for the most data-hungry, latency-sensitive applications Workload acceleration ü Intel® AVX-512 delivers 2 X more FLOPs/core across a range of applications, including crypto/compression, visualization, in-memory databases, deep learning, and modeling and simulation in HPC 1 ü Integrated Intel® Ethernet Connection X 722 with i. WARP RDMA (up to 4 x 10 Gb. E) reduces total system cost and power consumption [i. WARP RDMA improves latency for transfer of large storage blocks and virtual machine migration] ü Intel® Quick. Assist Technology accelerates critical workloads across server, storage and network Increased bandwidth ü More memory bandwidth, with up to DDR 4 -2666 MHz and 6 memory channels vs. Intel Xeon processor E 5 -2600 v 4 product family ü Intel® Optane™ and 3 D NAND solid state drives deliver high throughput increases, low latency, high quality of service (Qo. S) and ultra high endurance 5 Accelerate time Leaps in convergence and to advanced performance enable a market fully virtualized, software-defined datacenter that dynamically selfprovisions resources based on demand. ü Intel AVX-512 accelerates performance for compute-intensive workloads – some encryption algorithms, reducing performance overhead ü Intel® Key Protection Technology w/ Integrated Intel Quick Assist Technology, Intel® Platform Trust Technology, and available Intel® Key Protection Technology 3 enhance key management capabilities ü Designed for five 9’s (99. 999%) availability with over 70 RAS (Reliability, Availability and Serviceability) features including enhanced Intel® Run Sure Technology (Advanced RAS), to deliver mission-critical platforms expand into new market segments: Monetize new Performance, security, and workload optimization consumer features help CSPs experiences differentiate while infrastructure improvements help CSPs achieve agility and better total cost of ownership ü Intel® Trusted Execution Technology (Intel® TXT) with new One-Touch Activation (OTA) capability offers differentiated security service for customers in specific geos For footnotes and configurations, see slides 29 -30. 17
Intel® Xeon® Scalable Processors for advanced performance for security without Accelerated, Comms Service Providers network, cloud, and compromise across optimized cloud via emerging 5 g Reduced latency andapplications increased bandwidth ü More memory bandwidth, with up to DDR 4 -2666 MHz and 6 memory channels vs. Intel® Xeon® processor E 5 -2600 v 4 product family ü More IO bandwidth with 48 PCIe 3. 0 lanes vs. 40 lanes on Intel Xeon processor E 5 -2600 v 4 Advanced performance + memory + I/O ü Up to 28 cores vs. 24 cores/22 cores (on Intel Xeon processor E 7 v 4 /Intel Xeon processor E 5 -2600 v 4 families) ü New Intel® Mesh architecture with 3 Intel® Ultra Path Interconnect lanes provide greater inter-CPU bandwidth Crypto and compression workload acceleration ü Intel® AVX-512 delivers up to 2 X FLOPs/clock-cycle for data compression workloads 1 ü Intel® Quick. Assist Technology (QAT) delivers hardware acceleration for critical workloads such as cryptography and data compression every network application Exceptional processing of encryption algorithms ü Up to 100 Gbps crypto and public key encryption (PKE) workload acceleration via Intel Quick Assist Technology ü Increased platform security with Intel® Platform Trust Technology, Intel® Trusted Execution Technology with One-Touch Activation and available Intel® Key Protection Technology 3 ü Intel AVX-512 reduces performance overhead while accelerating performance of compute-intensive workloads, such as cryptography, in distributed environments converged platform, tools, and training Significant leaps in convergence of compute, memory, network and storage performance + software ecosystem optimizations enable a fully virtualized network and datacenter that dynamically self-provisions resources based on workload needs ü Integrated Intel® Ethernet Connection X 722 with i. WARP RDMA (up to 4 x 10 Gb. E) provide high data throughput & low latency ü Intel® Optane™ and 3 D NAND solid state drives deliver high throughput increases, low latency, high quality of service (Qo. S) and ultra high endurance 6 to resolve bottlenecks ü NEW Intel® Builders Program, Intel® Network Builders Fast Track and Intel® Network Builders University provide solution optimization, tools, and training For footnotes and configurations, see slides 29 -30. 18
Technologies & Features Intel® xeon® Scalable processors 21
Intel® Xeon® Scalable Processor • • Intel® Speed Shift Technology • Enhanced memory subsystem Re-architected the Ground Up Intel® AVX-512 with 32 from DP flops per core • Security & Virtualization enhancements (MBE, PPK, MPX) • • Optional Integrated Intel® Omni-Path Fabric (Intel® OPA) Skylake core microarchitecture, with data center specific enhancements • Data center optimized cache hierarchy – 1 MB L 2 per core, non-inclusive L 3 Features Cores Per Socket Threads Per Socket • New mesh interconnect architecture • Modular IO with integrated devices • New Intel® Ultra Path Interconnect (Intel® UPI) Intel® Xeon® Processor E 5 -2600 v 4 Intel® Xeon® Scalable Processor Up to 22 Up to 28 6 Channels DDR 4 Core Core Up to 44 threads Up to 56 threads Up to 55 MB Up to 38. 5 MB (non-inclusive) DDR 4 2 x QPI channels @ 9. 6 GT/s Up to 3 x UPI @ 10. 4 GT/s DDR 4 40 / 10 / PCIe* 3. 0 (2. 5, 5, 8 GT/s) 48 / 12 / PCIe 3. 0 (2. 5, 5, 8 GT/s) DDR 4 Shared L 3 Memory Population 4 channels of up to 3 RDIMMs, LRDIMMs, or 3 DS LRDIMMs 6 channels of up to 2 RDIMMs, LRDIMMs, or 3 DS LRDIMMs DDR 4 Omni-Path HFI Max Memory Speed Up to 2400 Up to 2666 TDP (W) 55 W-145 W 70 W-205 W Last-level Cache (LLC) QPI/UPI Speed (GT/s) PCIe* Lanes/ Controllers/Speed(GT/s) 2 or 3 UPI UPI 48 Lanes PCIe* 3. 0 Omni-Path DMI 3 22
Platform Topologies 2 S Configurations 4 S Configurations 8 S Configuration LBG SKL DMI LBG SKL Intel® UPI ** x 4 3 x 16 PCIe* 1 x 100 G Intel® OP Fabric 3 x 16 PCIe* 1 x 100 G SKL LBG SKL SKL LBG Intel® OP Fabric SKL (2 S-2 UPI & 2 S-3 UPI shown) DMI LBG 3 x 16 PCIe* Intel® Xeon®(4 S-2 UPI Scalable & 4 S-3 UPI shown) Processor supports configurations ranging from 2 S 2 UPI to 8 S DMI LBG 3 x 16 PCIe* LBG 23
Typical 2 -socket configuration Intel Xeon Scalable Purley (2017) Intel Xeon E 5 v 4 (2016) Intel® QPI CPU CPU DMI 2 PCIe* x 4 x 8 DMI LBG Intel® UPI ** x 4 3 x 16 PCIe* 1 x 100 G Intel® OP Fabric § Four DDR 4 memory channels § up to 24 DIMMs § Up to 80 PCIe lanes § Two QPI links (up to 9. 6 GT/s) CPU 3 x 16 PCIe* 1 x 100 G Intel® OP Fabric § Six DDR 4 memory channels § up to 24 DIMMs § Up to 96 PCIe lanes § Two UPI links (up to 10. 4 GT/s); up to 3 UPI links in 4 S and 8 S configurations § Integrated Intel® Omni-Path Architecture (Fabric) DDR 4 DIMMs ** PCIe* uplink connection for Intel® Quick. Assist Technology and Intel® Ethernet 24
Re-Architected L 2 & L 3 Cache Hierarchy Previous Architectures Intel® Xeon® Scalable Processor Architecture Shared L 3 1. 375 MB/core (non-inclusive) Shared L 3 2. 5 MB/core (inclusive) L 2 (256 KB private) Core L 2 (1 MB private) Core • On-chip cache balance shifted from shared-distributed (prior architectures) to private-local (Skylake architecture): • Shared-distributed shared-distributed L 3 is primary cache • Private-local private L 2 becomes primary cache with shared L 3 used as overflow cache • Shared L 3 changed from inclusive to non-inclusive: • Inclusive (prior architectures) L 3 has copies of all lines in L 2 • Non-inclusive (Skylake architecture) lines in L 2 may not exist in L 3 Skylake-SP cache hierarchy architected specifically for Data center use case 27
Intel® Xeon® Processor Scalable Family: New Platform Controller Hub (PCH) DMI PCIe* 8/16 uplink High Speed I/O Integrates classic server functions in a common footprint DMI I/F ME Mux SPI • Intel® Ethernet Connection X 722 with up to 4 x 10 Gb. E e. SPI 10 Gb. E Intel® QAT IE SMBus • Intel® Quick. Assist Technology with processing up to 100 Gb/s LPC x. HCI SATA s. SAT A PCIe Root High Speed I/O 1 Gbe LAN HD Audio Low Speed I/O 28
Intel® Advanced Vector Extensions-512 (AVX-512) End Customer Value: Workload-optimized performance, throughput increases, and H/W-enhanced security improvements for familiar analytics, HPC, video transcode, cryptography, and compression software. Problems Solved: 1. Achieve more work per cycle (doubles width of data registers) 2. Minimize latency & overhead (doubles the number of registers) with ultra-wide (512 -bit) vector processing capabilities (that 2 x FMA processing engines are available on Intel® Xeon® Platinum and Intel® Xeon® Gold Processors) Value perform securit pillars ance Segments y Cloud Service Providers Comms Service Providers Enterprise Proof points Up to 2 x FLOPS/clock cycle Up to 4 x greater throughput 1 2 Accelerates performance for your most demanding computational tasks * FLOPs = Floating Point Operations 1 Peak performance vs. Intel® AVX 2. As measured by Intel® Xeon® Processor Scalable Family with Intel® AVX-512 compared to an Intel® Xeon® E 5 v 4 with Intel® AVX 2 2 Vectorized floating-point throughput. As measured by Intel® Xeon® Processor Scalable Family with Intel® AVX-512 compared to an Intel® Xeon® E 5 v 4 with Intel® AVX 2 29
Intel® Advanced Vector Extensions-512 (AVX-512) • 512 -bit wide vectors Microarchitecture Instruction Set SP FLOPs / cycle DP FLOPs / cycle • 8 64 b mask registers Skylake Intel® AVX-512 & FMA 64 32 • Embedded broadcast Haswell / Broadwell Intel AVX 2 & FMA 32 16 Sandybridge Intel AVX (256 b) 16 8 8 4 • 32 operand registers • Embedded rounding Nehalem SSE (128 b) Intel AVX-512 Instruction Types AVX-512 -F AVX-512 Foundation Instructions AVX-512 -VL Vector Length Orthogonality : ability to operate on sub-512 vector sizes AVX-512 -BW 512 -bit Byte/Word support AVX-512 -DQ Additional D/Q/SP/DP instructions (converts, transcendental support, etc. ) AVX-512 -CD Conflict Detect : used in vectorizing loops with potential address conflicts Powerful instruction set for data-parallel computation 30
Non-AVX All Core Turbo AVX 2 -Heavy (FP & int-mul) AVX 512 -Light (without FP & int-mul) AVX 2 All Core Turbo Cores AVX 512 -Heavy (FP & int-mul) AVX 512 All Core Turbo … Non-AVX SSE AVX 2 -Light (without FP & int-mul) Non-AVX_Base AVX 2_Base AVX 512_Base AVX 2 All Core Frequency Limit Non-AVX Code Type AVX 512_Turbo AVX 512 Frequency of each core is determined independently based on workload demand Non-AVX_Turbo AVX 2 • Mixed Workloads Cores running non-AVX, Intel® AVX 2 light/heavy, and Intel® AVX-512 light/heavy code have different turbo frequency limits Frequency • Frequency Behavior While Running Intel® AVX Code AVX 2 Non-AVX Cores using AVX-512 Cores using AVX 2 Cores not using AVX 31
3. 1 3000 3259 2. 8 2. 1 2000 1178 1500 1000 500 3 2. 5 2500 3. 5 2 2034 1. 5 669 1 760 768 791 767 SSE 4. 2 AVX 512 0 0. 5 0 GFLOPs Power (W) Frequency (GHz) GFLOPs / Watt 6. 00 4. 00 2. 00 4. 83 2. 92 1. 00 1. 74 0. 00 SSE 4. 2 Normalized to SSE 4. 2 GFLOPs/GHz 3500 Core Frequency GFLOPs, System Power LINPACK Performance Normalized to SSE 4. 2 GFLOPs/Watt Performance and Efficiency with Intel® AVX-512 AVX 2 GFLOPs / GHz 8. 00 AVX 512 7. 19 6. 00 3. 77 4. 00 2. 00 1. 95 Intel® AVX-512 delivers significant performance and efficiency gains 0. 00 SSE 4. 2 AVX 512 Source as of June 2017: Intel internal measurements on platform with Xeon Platinum 8180, Turbo enabled, UPI=10. 4, SNC 1, 6 x 32 GB DDR 4 -2666 per CPU, 1 DPC. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and Mobile. Mark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. 32
Enhanced Hardware-based Virtualization Support Baseline (Server) VT-x: Base VT-d: Interrupt Remapping VT-d: Large Pages Evolution of Intel® Virtualization Technology VT-x 2: Base VT-x: Pause- Loop Exiting VT-d: Queued Invalidations VT-x: Real Mode Virtualization VT-d: Pass-thru DMA Intel® Xeon® processor E 5 v 3 family (2014) Intel® Xeon® processor E 5 v 4 family (2016) (VT-x) Intel® Xeon® processor E 5 v 2 family (2013) § Mode-based § Posted Interrupts § Page Modification § Extended Page Table 2007 -2008 2005 -2006 § § § (EPT) Accessed/Dirty bits § Intel® Virtual Machine § Advanced Programmable Control Shadowing (VMCS) Interrupt Controller virtualization (APICv) – full capability Performance extensions (VTx/VTi) Extended Page Tables (EPT) APIC-TPR VPID ECRR Faster VM boot Intel® Xeon® Scalable Processor (2017) Logging (PML) § VM enter/exit Execution Control (MBE) § Timestamp Counter Scaling (TSC) Latency Reduction Up to 4. 2 x VMs More Supported 1 1 Up to 4. 2 x more VMs based on server virtualization consolidation workload: Based on Intel® internal estimates 1 -Node, 2 x Intel® Xeon® Processor E 5 -2690 on Romley-EP with 256 GB Total Memory on VMware ESXi* 6. 0 GA using Guest OS RHEL 6. 4, glassfsh 3. 1. 2. 2, postgresql 9. 2. Data Source: Request Number: 1718, Benchmark: server virtualization consolidation, Score: 377. 6 @ 21 VMs vs. 1 -Node, 2 x Intel® Xeon® Platinum 8180 Processor on Wolf Pass SKX with 768 GB Total Memory on VMware ESXi 6. 0 U 3 GA using Guest OS RHEL 6 64 bit. Data Source: Request Number: 2563, Benchmark: server virtualization consolidation, Score: 1580 @ 90 VMs. Higher is better. 33
fabric Integrated and discrete Intel® Omni-Path Architecture 35
Integrated Intel® Omni-Path Architecture Up to TWO additional PCIe x 16 slots are available for maximizing I/O density 1 Compute Node Platform Benefits - Maximized I/O Density per Node x 16 SKX-F or SKX OPA HFI GP GP U Significantly more I/O capacity for compute or storage nodes 1 IFP Cable IFT Card GP U U GP GP U U Intel Xeon Processor-F HFI Storage Node or File System Server OPA HFI SKUS WITH INTEGRATED INTEL® OMNI-PATH ARCHITECTURE FABRIC Class Platinum Gold Gold x 16 SKU Cores Base Non-AVX Speed (GHz) TDP (W) 8176 F 8160 F 6148 F 6142 F 6138 F 6130 F 6126 F 28 24 20 16 12 2. 1 2. 4 2. 6 2. 0 2. 1 2. 6 173 160 160 135 105 Intel Xeon Processor-F HFI For illustrative purposes only. Assumes each CPU socket is configured with all 48 PCIe lanes routed to three x 16 slots, or 96 total lanes for a 2 S Purley platform. PCIe slot count and PCIe device support will vary by OEM platform, so check with your OEM for more details. 1
HIGHER is Better Performance (Higher Better) NWChem Workload: Siosi 3 1 2 4 8 Number of Nodes 16 Performance (Higher Better) Intel® Omni-Path Architecture: Demonstrated Performance Intel® Xeon® Platinum 8170 processor vs Intel® Xeon® processor E 5 -2697 a v 4 Workload: Siosi 5 Intel® Xeon® E 5 -2697 a v 4 processor 2 Intel® Xeon® Platinum 8170 processor 1 E 5 -2697 A processor 8170 Xeon Platinum Scalable processor Advantage Base 2. 6 2. 1 frequency Memory 64 GB 2133 192 GB 2666 size/speed MHz 2 4 8 X ü 16 Number of Nodes 1 MPI rank per node 1. Intel® Xeon® Platinum 8170 dual socket servers, 2. 10 GHz, 26 cores/socket, 64 GB 2666 MHz DDR 4 memory per node. RHEL* 7. 3, 3. 10. 0 -514. el 7. x 86_64 kernel. 2. Intel® Xeon® Processor E 5 -2697 A v 4 dual socket servers, 2. 10 GHz, 16 cores/node, 64 GB 2133 MHz DDR 4 memory per node. RHEL 7. 3. BIOS settings: Snoop hold-off timer = 9, Early snoop disabled, Cluster on die disabled. Common configurations: Intel® Turbo Boost Technology enabled, Intel® Hyper-Threading Technology disabled. Intel Fabric Suite 10. 3. 1. 0. 22. Intel Corporation Device 24 f 0 – Series 100 HFI ASIC. OPA Switch: Series 100 Edge Switch – 48 port. OPA parameters: -genv I_MPI_FABRICS shm: tmi -genv I_MPI_TMI_PROVIDER psm 2. HFI parameters: krcvqs=4 eager_buffer_size=8388608 max_mtu=10240 NWChem release 6. 6. Binary: nwchem_armci-mpi_intel-mpi_mkl with MPI-PR run over MPI-1. Workload: siosi 3 and siosi 5. Intel MPI 2017. 1. 132. 2 ranks per node, 1 rank for computation and 1 rank for communication. http: //www. nwchem-sw. org/index. php/Main_Page Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and Mobile. Mark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http: //www. intel. com/performance. Copyright © 2017, Intel Corporation. *Other names and brands may be claimed as the property of others. 37
- Slides: 26