System Architecture Big Iron NUMA Joe Chang jchang

System Architecture: Big Iron (NUMA) Joe Chang jchang 6@yahoo. com www. qdpma. com

About Joe Chang SQL Server Execution Plan Cost Model True cost structure by system architecture Decoding statblob (distribution statistics) SQL Clone – statistics-only database Tools Exec. Stats – cross-reference index use by SQLexecution plan Performance Monitoring, Profiler/Trace aggregation

Scaling SQL on NUMA Topics OLTP – Thomas Kejser session “Designing High Scale OLTP Systems” Data Warehouse Ongoing Database Development Bulk Load – SQL CAT paper + TK session “The Data Loading Performance Guide” Other Sessions with common coverage: Monitoring and Tuning Parallel Query Execution II, R Meyyappan (SQLBits 6) Inside the SQL Server Query Optimizer, Conor Cunningham Notes from the field: High Performance Storage, John Langford SQL Server Storage – 1000 GB Level, Brent Ozar

Server Systems and Architecture

Symmetric Multi-Processing CPU CPU System Bus MCH PXH ICH SMP, processors are not dedicated to specific tasks (ASMP), single OS image, each processor can acess all memory SMP makes no reference to memory architecture? Not to be confused to Simultaneous Multi-Threading (SMT) Intel calls SMT Hyper-Threading (HT), which is not to be confused with AMD Hyper-Transport (also HT)

Non-Uniform Memory Access CPU CPU CPU CPU Memory Controller Node Controller Shared Bus or X Bar NUMA Architecture - Path to memory is not uniform 1) Node: Processors, Memory, Separate or combined Memory + Node Controllers 2) Nodes connected by shared bus, cross-bar, ring Traditionally, 8 -way+ systems Local memory latency ~150 ns, remote node memory ~300 -400 ns, can cause erratic behavior if OS/code is not NUMA aware

AMD Opteron Opteron HT 2100 HT 1100 Technically, Opteron is NUMA, but remote node memory latency is low, no negative impact or erratic behavior! For practical purposes: behave like SMP system Local memory latency ~50 ns, 1 hop ~100 ns, two hop 150 ns? Actual: more complicated because of snooping (cache coherency traffic)

8 -way Opteron Sys Architecture CPU 0 CPU 1 CPU 2 CPU 3 CPU 4 CPU 5 CPU 6 CPU 7 Opteron processor (prior to Magny-Cours) has 3 Hyper-Transport links. Note 8 -way top and bottom right processors use 2 HT to connect to other processors, 3 rd HT for IO, CPU 1 & 7 require 3 hops to each other

http: //www. techpowerup. com/img/09 -08 -26/17 d. jpg

Nehalem System Architecture Intel Nehalem generation processors have Quick Path Interconnect (QPI) Xeon 5500/5600 series have 2, Xeon 7500 series have 4 QPI 8 -way Glue-less is possible

NUMA Local and Remote Memory Local memory is closer than remote Physical access time is shorter What is actual access time? With cache coherency requirement!

HT Assist – Probe Filter part of L 3 cache used as directory cache ZDNET

Source Snoop Coherency From HP PREMA Architecture whitepaper: All reads result in snoops to all other caches, … Memory controller cannot return the data until it has collected all the snoop responses and is sure that no cache provided a more recent copy of the memory line

DL 980 G 7 From HP PREAM Architecture whitepaper: Each node controller stores information about* all data in the processor caches, minimizes inter-processor coherency communication, reduces latency to local memory (*only cache tags, not cache data)

HP Pro. Liant DL 980 Architecture Node Controllers reduces effective memory latency

Superdome 2 – Itanium, sx 3000 Agent – Remote Ownership Tag + L 4 cache tags 64 M e. DRAM L 4 cache data

IBM x 3850 X 5 (Glue-less) Connect two 4 -socket Nodes to make 8 -way system

Fujitsu R 900 4 IOH 14 x 8 PCI-E slots, 2 x 4, 1 x 8 internal

OS Memory Models SUMA: Sufficiently Uniform Memory Access Memory interleaved across nodes 48 32 16 0 49 33 17 1 50 34 18 2 51 35 19 3 52 36 20 4 Node 0 53 37 21 5 54 38 22 6 55 39 23 7 56 40 24 8 Node 0 57 41 25 9 58 42 26 10 59 43 27 11 60 44 28 12 Node 0 61 45 29 13 62 46 30 14 63 47 31 15 Node 0 2 1 NUMA: first interleaved within a node, then spanned across nodes 12 8 4 0 13 9 5 1 14 10 6 2 Node 0 15 11 7 3 28 24 20 16 29 25 21 17 30 26 22 18 Node 0 31 27 23 19 44 40 36 32 45 41 37 33 46 42 38 34 Node 0 47 43 39 35 60 56 52 48 61 57 53 49 62 58 54 50 Node 0 Memory stripe is then spanned across nodes 63 59 55 51 2 1

OS Memory Models SUMA: Sufficiently Uniform Memory Access Memory interleaved across nodes 24 16 8 0 25 17 9 1 Node 0 26 18 10 2 27 19 11 3 Node 1 28 20 12 4 29 21 13 5 30 22 14 6 Node 2 31 23 15 7 2 1 Node 3 NUMA: first interleaved within a node, then spanned across nodes 6 4 2 0 Node 0 7 5 3 1 14 12 10 8 Node 1 15 13 11 9 22 20 18 16 Node 2 23 21 19 17 30 28 26 24 31 29 27 25 2 Node 3 Memory stripe is then spanned across nodes 1

Windows OS NUMA Support Memory models SUMA – Sufficiently Uniform Memory Access 24 16 8 0 25 17 9 1 26 18 10 2 Node 0 27 19 11 3 28 20 12 4 Node 1 29 21 13 5 30 22 14 6 Node 2 31 23 15 7 Node 3 Memory is striped across NUMA nodes NUMA – separate memory pools by Node 6 4 2 0 7 5 3 1 Node 0 14 12 10 8 15 13 11 9 Node 1 22 20 18 16 23 21 19 17 Node 2 30 28 26 24 31 29 27 25 Node 3

Memory Model Example: 4 Nodes SUMA Memory Model memory access uniformly distributed 25% of memory accesses local, 75% remote NUMA Memory Model Goal is better than 25% local node access True local access time also needs to be faster Cache Coherency may increase local access

Architecting for NUMA End to End Affinity App Server North East Mid Atlantic South East TCP Port 1440 CPU Node 0 1441 Node 1 1442 Node 3 Central 1443 Texas 1444 Node 4 Mountain 1445 Node 5 California 1446 Node 6 Pacific NW 1447 Node 7 Memory 0 -0 0 -1 1 -0 1 -1 2 -0 2 -1 3 -0 3 -1 4 -0 4 -1 5 -0 5 -1 6 -0 6 -1 7 -0 7 -1 Table NE Mid. A Web determines port for each user by group (but should not be by geography!) SE Cen Tex Mnt Cal PNW Affinitize port to NUMA node Each node access localized data (partition? ) OS may allocate substantial chunk from Node 0?

HP-UX LORA HP-UX – Not Microsoft Windows Locality-Optimizer Resource Alignment 12. 5% Interleaved Memory 87. 5% NUMA node Local Memory

System Tech Specs Processors Cores DIMM PCI-E G 2 Total Cores Max memory Base 2 x Xeon X 56 x 0 6 18 5 x 8+, 1 x 4 12 192 G* $7 K 4 x Opteron 6100 12 32 5 x 8, 1 x 4 48 512 G $14 K 4 x Xeon X 7560 8 64 4 x 8, 6 x 4† 32 1 TB $30 K 8 x Xeon X 7560 8 128 9 x 8, 5 x 4‡ 64 2 TB $100 K 8 GB $400 ea 18 x 8 G = 144 GB, $7200, 16 GB $1100 ea 12 x 16 G =192 GB, $13 K, 64 x 8 G = 512 GB - $26 K 64 x 16 G = 1 TB – $70 K Max memory for 2 -way Xeon 5600 is 12 x 16 = 192 GB, † Dell R 910 and HP DL 580 G 7 have different PCI-E ‡ Pro. Liant DL 980 G 7 can have 3 IOH for additional PCI-E slots

Software Stack

Operating System Windows Server 2003 RTM, SP 1 Network limitations (default) Impacts OLTP Scalable Networking Pack (912222) Windows Server 2008 R 2 (64 -bit only) Breaks 64 logical processor limit Search: MSI-X NUMA IO enhancements? Do not bother trying to do DW on 32 -bit OS or 32 -bit SQL Server Don’t try to do DW on SQL Server 2000

SQL Server version SQL Server 2000 Serious disk IO limitations (1 GB/sec ? ) Problematic parallel execution plans SQL Server 2005 (fixed most S 2 K problems) 64 -bit on X 64 (Opteron and Xeon) SP 2 – performance improvement 10%(? ) SQL Server 2008 & R 2 Compression, Filtered Indexes, etc Star join, Parallel query to partitioned table

Configuration SQL Server Startup Parameter: E Trace Flags 834, 836, 2301 Auto_Date_Correlation Order date < A, Ship date > A Implied: Order date > A-C, Ship date < A+C Port Affinity – mostly OLTP Dedicated processor ? for log writer ?