Intel Sunny Cove Core Lihu Rappoport CPU Core

  • Slides: 17
Download presentation
Intel Sunny Cove Core Lihu Rappoport CPU Core Architect, Intel May 13, 2019 1

Intel Sunny Cove Core Lihu Rappoport CPU Core Architect, Intel May 13, 2019 1

THIS PRESENTATION INCLUDES FORWARD-LOOKING STATEMENTS RELATING TO INTEL. ALL STATEMENTS THAT ARE NOT HISTORICAL

THIS PRESENTATION INCLUDES FORWARD-LOOKING STATEMENTS RELATING TO INTEL. ALL STATEMENTS THAT ARE NOT HISTORICAL FACTS ARE SUBJECT TO A NUMBER OF RISKS AND UNCERTAINTIES, AND ACTUAL RESULTS MAY DIFFER MATERIALLY. PLEASE REFER TO INTEL’S MOST RECENT EARNINGS RELEASE, 10 -Q AND 10 -K FILINGS FOR THE RISK FACTORS THAT COULD CAUSE ACTUAL RESULTS TO DIFFER. May 13, 2019 2

Sunny Cove Core • The Core used in the coming Intel 10 nm processors

Sunny Cove Core • The Core used in the coming Intel 10 nm processors • General purpose performance – Microarchitecture enhancements improve performance and efficiency across a broad set of applications • Wider allocation and execution of instructions in parallel • Deeper out-of-order exposes more parallelism • Smarter – improved algorithms • Special purpose performance – Architecture extensions (new instructions) targeted at specific use-cases and algorithms – Building upon the microarchitecture May 13, 2019 3

Sunny Cove Block Diagram MSROM I-TLB + I-cache BPU decode μop Cache μop Queue

Sunny Cove Block Diagram MSROM I-TLB + I-cache BPU decode μop Cache μop Queue Port 0 Port 1 Port 5 INT ALU LEA Shift JMP ALU LEA Mul i. DIV ALU LEA Mul. Hi VEC Allocate / Rename / Move Elimination / Zero Idiom FMA ALU Shift fp. DIV FMA ALU Shift Shuffle Scheduler Port 6 P 4 ALU LEA Shift JMP P 9 Store Data P 2 P 8 P 3 P 7 AGU AGU Load STA 48 KB DCU ALU 512 KB ML$ Shuffle May 13, 2019 SOC 4

Sunny Cove Frontend I-TLB + I-cache MSROM 4 µops decode 5 µops μop Queue

Sunny Cove Frontend I-TLB + I-cache MSROM 4 µops decode 5 µops μop Queue BPU μop Cache 6 µops • Fetch instructions and decodes them into µops • Smarter: Improved Branch Prediction accuracy • Larger µop cache increased hit-rate increased effective Frontend width wider • μop Queue: increased from 64 to 70 entries /thread – More loops fit in μop Queue, and can be replayed directly from the μop Queue May 13, 2019 5

Out-of-Order and Execution Engines • Out-of-order: allocate and rename the µops, dispatch µops to

Out-of-Order and Execution Engines • Out-of-order: allocate and rename the µops, dispatch µops to EXE when ready, retire the µops • Execution: execute the µops Allocate / Rename / Move Elimination / Zero Idiom INT ALU LEA Shift JMP VEC Scheduler Port 0 Port 1 Port 5 Port 6 P 4 P 9 ALU ALU LEA LEA Mul. Hi Shift i. DIV JMP FMA ALU Shift fp. DIV Shuffle ALU Shuffle Store Data P 2 P 8 P 3 P 7 AGU AGU Ports 0, 1, and 5 support 256 -bit vectors Ports 0 and 5 support 512 -bit vectors May 13, 2019 6

Out-of-Order and Execution Engines • Wider: 4 5 wide allocation, 8 10 Execution Ports

Out-of-Order and Execution Engines • Wider: 4 5 wide allocation, 8 10 Execution Ports • Deeper: significant increase of Reorder-Buffer and Scheduler sizes enables to expose more parallelism INT ALU LEA Shift JMP VEC Allocate / Rename / Move Elimination / Zero Idiom 5 µops Scheduler Port 0 Port 1 Port 5 Port 6 P 4 P 9 P 2 P 8 P 3 P 7 ALU ALU LEA LEA Mul. Hi Shift i. DIV JMP FMA ALU Shift fp. DIV Shuffle Store Data AGU AGU ALU Shuffle May 13, 2019 7

Out-of-Order and Execution Engines • Extra SIMD shuffle, 1 -cycle LEA on 4 ports

Out-of-Order and Execution Engines • Extra SIMD shuffle, 1 -cycle LEA on 4 ports • A new i. DIV unit significantly reduces the latency and improves the throughput of integer divide operations Allocate / Rename / Move Elimination / Zero Idiom INT ALU LEA Shift JMP VEC Scheduler Port 0 Port 1 Port 5 Port 6 P 4 P 9 ALU ALU LEA LEA Mul. Hi Shift i. DIV JMP FMA ALU Shift fp. DIV Shuffle Store Data P 2 P 8 P 3 P 7 AGU AGU ALU Shuffle May 13, 2019 8

Cache and Memory Subsystem • Handle load/store µops which access the memory • ML$:

Cache and Memory Subsystem • Handle load/store µops which access the memory • ML$: Mid level cache, and interaction with the SOC Store Data AGU AGU Load STA 48 KB DCU 512 KB ML$ May 13, 2019 SOC 9

Cache and Memory Subsystem • 50% increase in size of the L 1 Data

Cache and Memory Subsystem • 50% increase in size of the L 1 Data Cache • 2×L 1 Store Bandwidth: 3 4 AGUs, 1 2 Store Data • Deeper Load Buffer and Store Buffer expose more parallelism for loads and stores • Reduced effective Load Latency Store Data AGU AGU Load STA 48 KB DCU 512 KB ML$ May 13, 2019 SOC 10

Cache and Memory Subsystem • Larger 2 nd level TLB: 1. 5 K entries

Cache and Memory Subsystem • Larger 2 nd level TLB: 1. 5 K entries 2 K entries • Enhanced data prefetchers • L 2 cache: size doubled from 256 KB to 512 KB Store Data AGU AGU Load STA 48 KB DCU 512 KB ML$ May 13, 2019 SOC 11

Special Purpose Performance • New instructions, which use new dedicated HW for accelerating –

Special Purpose Performance • New instructions, which use new dedicated HW for accelerating – AI/Machine Learning – Cryptography – Compression/Decompression and special SIMD/Vector Processing May 13, 2019 12

AI/Machine Learning • Intel Deep Learning Boost (DL Boost) – VNNI: Vector Neural Net

AI/Machine Learning • Intel Deep Learning Boost (DL Boost) – VNNI: Vector Neural Net Instructions – new AVX-512 instructions for inference acceleration – Multiply & Accumulate quadruples of INT 8/INT 16 numbers – Peak throughput: 3× for INT 8 and 2× for INT 16 SRC 1 8 -bit A 0 A 1 A 2 A 3 . . A 63 SRC 2 8 -bit B 0 B 1 B 2 B 3 . . B 63 SRC 3 / DEST 32 -bit . . C 0 A 0×B 0 + A 1×B 1 + A 2×B 2 + A 3×B 3 + C 0 . . C 15 . . 63. +. A 60×B 60 + A 61×B 61 + A 62×B 62 + A 63×B C 15 INT 8 Multiply & Accumulate May 13, 2019 13

Cryptography • SHA-NI – acceleration of Secure Hash Algorithms – SHA-1 and SHA-256 message

Cryptography • SHA-NI – acceleration of Secure Hash Algorithms – SHA-1 and SHA-256 message schedule and rounds • Big-Number Arithmetic (Integer Fused Multiply Add ) – Vectorized RSA and other public key crypto algorithms • Galois Field New Instructions (GFNI) – Encryption algorithms, error correction algorithms, and bit matrix multiplications • Vector AES and Vector carry-less multiply instructions – AES (Advanced Encryption Standard) and AES-GCM (AES mode that allows parallel processing) May 13, 2019 14

Compression/Decompression and Special SIMD/Vector Processing • Bit Algebra – POPCNT returns the #of bits

Compression/Decompression and Special SIMD/Vector Processing • Bit Algebra – POPCNT returns the #of bits set to 1 in byte/word/DW/QW – Bit Shuffle – shuffle bits from QW elements • VBMI – Vector Bit Manipulation Instructions – Permutes, shifts, expand, and compress operations – Used for columnar database access, discrete mathematics, dictionary based decompression, data-mining routines May 13, 2019 15

Summary • Sunny Cove is a wider, deeper and smarter Core – – 5

Summary • Sunny Cove is a wider, deeper and smarter Core – – 5 wide rename, 10 wide dispatch/execution, 2 nd Store Deep out-of-order and memory buffers 1. 5× L 1 data cache, 2× L 2 cache (MLC) Branch prediction and prefetch enhancements • Sunny Cove adds new instructions and hardware for enhancing special purpose performance – AI/Machine Learning (VNNI) – Cryptography (SHA-NI, IFMA, GFNI, Vec AES, Vec CLMUL) – Compression/Decompression and Special SIMD/Vector Processing (Bit. Alg, VBMI) May 13, 2019 16

Legal DISCLOSURES All information provided here is subject to change without notice. Contact your

Legal DISCLOSURES All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications, roadmaps, and related information. Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel. com. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. Intel, the Intel logo, Intel Core, Intel Optane, Intel Performance Maximizer, and Thunderbolt are trademarks of Intel Corporation or its subsidiaries in the U. S. and/or other countries. © Intel Corporation 2019. *Other names and brands may be claimed as the property of others. May 13, 2019 17