Storage Performance 2013 Joe Chang www qdpma com

  • Slides: 92
Download presentation
Storage Performance 2013 Joe Chang www. qdpma. com #SQLSat. Riyadh

Storage Performance 2013 Joe Chang www. qdpma. com #SQLSat. Riyadh

About Joe • • • SQL Server consultant since 1999 Query Optimizer execution plan

About Joe • • • SQL Server consultant since 1999 Query Optimizer execution plan cost formulas (2002) True cost structure of SQL plan operations (2003? ) Database with distribution statistics only, no data 2004 Decoding statblob/stats_stream – writing your own statistics • Disk IO cost structure • Tools for system monitoring, execution plan analysis See Exec. Stats on www. qdpma. com

Storage Performance Chain SQL Server • All elements must be correct Engine SQL Server

Storage Performance Chain SQL Server • All elements must be correct Engine SQL Server – No weak links • Perfect on 6 out 7 elements and 1 not correct = bad IO performance Extent SQL Server File Dir At/SAN Pool SAS/FC RAID Group SAS HDD SDD

Storage Performance Overview • System Architecture – PCI-E, SAS, HBA/RAID controllers • SSD, NAND,

Storage Performance Overview • System Architecture – PCI-E, SAS, HBA/RAID controllers • SSD, NAND, Flash Controllers, Standards – Form Factors, Endurance, ONFI, Interfaces • SLC, MLC Performance • Storage system architecture – Direct-attach, SAN • Database – SQL Server Files, File. Group

Sandy Bridge EN & EP EN QPI C 3 C 2 C 1 C

Sandy Bridge EN & EP EN QPI C 3 C 2 C 1 C 0 PCI-E LLC MI C 4 C 5 C 6 C 7 QPI C 3 C 2 C 1 C 0 x 4 PCI-E x 8 QPI PCH C 4 C 5 C 6 C 7 MI PCI-E x 8 PCI-E LLC PCI-E x 8 MI PCIe x 8 DMI 2 MI PCI-E EP QPI C 3 C 2 C 1 C 0 LLC MI QPI C 3 C 2 C 1 C 0 MI LLC MI C 4 C 5 C 6 C 7 MI x 4 PCI-E x 8 PCI-E x 8 PCIe x 8 PCIe x 8 DMI 2 PCH C 4 C 5 C 6 C 7 PCI-E 80 PCI-E gen 3 lanes + 8 gen 2 possible Dell T 620 4 x 16, 2 x 8, 1 x 4 Dell R 720 1 x 16, 6 x 8 HP DL 380 G 8 p 2 x 16, 3 x 8, 1 x 4 Supermicro X 9 DRX+F 10 x 8, 1 x 4 g 2 Xeon E 5 -2400, Socket B 2 1356 pins 1 QPI 8 GT/s, 3 DDR 3 memory channels 24 PCI-E 3. 0 8 GT/s, DMI 2 (x 4 @ 5 GT/s) E 5 -2470 8 core 2. 3 GHz 20 M 8. 0 GT/s (3. 1) E 5 -2440 6 core 2. 4 GHz 15 M 7. 2 GT/s (2. 9) E 5 -2407 4 c – 4 t 2. 2 GHz 10 M 6. 4 GT/s (n/a) Xeon E 5 -2600, Socket: R 2011 -pin 2 QPI, 4 DDR 3, 40 PCI-E 3. 0 8 GT/s, DMI 2 Model, cores, clock, LLC, QPI, (Turbo) E 5 -2690 8 core 2. 9 GHz 20 M 8. 0 GT/s (3. 8)* E 5 -2680 8 core 2. 7 GHz 20 M 8. 0 GT/s (3. 5) E 5 -2670 8 core 2. 6 GHz 20 M 8. 0 GT/s (3. 3) E 5 -2667 6 core 2. 9 GHz 15 M 8. 0 GT/s (3. 5)* E 5 -2665 8 core 2. 4 GHz 20 M 8. 0 GT/s (3. 1) E 5 -2660 8 core 2. 2 GHz 20 M 8. 0 GT/s (3. 0) E 5 -2650 8 core 2. 0 GHz 20 M 8. 0 GT/s (2. 8) E 5 -2643 4 core 3. 3 GHz 10 M 8. 0 GT/s (3. 5)* E 5 -2640 6 core 2. 5 GHz 15 M 7. 2 GT/s (3. 0) Disable cores in BIOS/UEFI?

Xeon E 5 -4600 C 3 C 2 C 1 C 0 MI PCI-E

Xeon E 5 -4600 C 3 C 2 C 1 C 0 MI PCI-E LLC QPI MI PCI-E QPI C 4 C 5 C 6 C 7 QPI C 3 C 2 C 1 C 0 MI C 4 C 5 C 6 C 7 MI PCI-E LLC MI C 4 C 5 C 6 C 7 MI PCI-E PCI-E DMI 2 PCI-E PCI-E Dell R 820 HP DL 560 G 8 p Supermicro X 9 QR PCI-E QPI PCI-E LLC PCI-E QPI MI QPI PCI-E LLC C 4 C 5 C 6 C 7 PCI-E MI C 3 C 2 C 1 C 0 PCI-E PCI-E QPI Xeon E 5 -4600 Socket: R 2011 -pin 2 QPI, 4 DDR 3 40 PCI-E 3. 0 8 GT/s, DMI 2 Model, cores, Clock, LLC, QPI, (Turbo) E 5 -4650 8 core 2. 70 GHz 20 M 8. 0 GT/s (3. 3)* E 5 -4640 8 core 2. 40 GHz 20 M 8. 0 GT/s (2. 8) E 5 -4620 8 core 2. 20 GHz 16 M 7. 2 GT/s (2. 6) E 5 -4617 6 c - 6 t 2. 90 GHz 15 M 7. 2 GT/s (3. 4) E 5 -4610 6 core 2. 40 GHz 15 M 7. 2 GT/s (2. 9) E 5 -4607 6 core 2. 20 GHz 12 M 6. 4 GT/s (n/a) E 5 -4603 4 core 2. 00 GHz 10 M 6. 4 GT/s (n/a) Hi-freq 6 -core gives up HT No high-frequency 4 -core, 2 x 16, 4 x 8, 1 int 2 x 16, 3 x 8, 1 x 4 7 x 16, 1 x 8 160 PCI-E gen 3 lanes + 16 gen 2 possible

2 PCI-E, SAS & RAID CONTROLLERS

2 PCI-E, SAS & RAID CONTROLLERS

PCI-E gen 1, 2 & 3 Gen • • Raw bit Unencoded Bandwidth rate

PCI-E gen 1, 2 & 3 Gen • • Raw bit Unencoded Bandwidth rate per direction BW x 8 Net Bandwidth Per direction x 8 PCIe 1 2. 5 GT/s 2 Gbps ~250 MB/s 2 GB/s 1. 6 GB/s PCIe 2 5. 0 GT/s 4 Gbps ~500 MB/s 4 GB/s 3. 2 GB/s PCIe 3 8. 0 GT/s 8 Gbps ~1 GB/s 8 GB/s 6. 4 GB/s? PCIe 1. 0 & 2. 0 encoding scheme 8 b/10 b PCIe 3. 0 encoding scheme 128 b/130 b Simultaneous bi-directional transfer Protocol Overhead – Sequence/CRC, Header – 22 bytes, (20%? ) Adaptec Series 7: 6. 6 GB/s, 450 K IOPS

PCI-E Packet Net realizable bandwidth appears to be 20% less (1. 6 GB/s of

PCI-E Packet Net realizable bandwidth appears to be 20% less (1. 6 GB/s of 2. 0 GB/s)

PCIe Gen 2 & SAS/SATA 6 Gbps • SATA 6 Gbps – single lane,

PCIe Gen 2 & SAS/SATA 6 Gbps • SATA 6 Gbps – single lane, net BW 560 MB/s • SAS 6 Gbps, x 4 lanes, net BW 2. 2 GB/s – Dual-port, SAS protocol only • Not supported by SATA SAS x 4 6 G A A B 2. 2 GB/s PCIe g 2 x 8 3. 2 GB/s HBA SAS x 4 6 G B Some bandwidth mismatch is OK, especially on downstream side

PCIe 3 & SAS • 12 Gbps – coming soon? Slowly? – Infrastructure will

PCIe 3 & SAS • 12 Gbps – coming soon? Slowly? – Infrastructure will take more time SAS x 4 6 G PCIe g 3 x 8 HBA SAS x 4 6 Gb SAS Expander SAS x 4 6 Gb SAS x 4 12 G PCIe g 3 x 8 HBA SAS x 4 12 G SAS x 4 6 Gb SAS Expander SAS x 4 6 Gb PCIe 3. 0 x 8 HBA 2 SAS x 4 12 Gbps ports or 4 SAS x 4 6 Gbps port if HBA can support 6 GB/s

PCIe Gen 3 & SAS 6 Gbps

PCIe Gen 3 & SAS 6 Gbps

LSI 12 Gpbs SAS 3008

LSI 12 Gpbs SAS 3008

PCIe RAID Controllers? • 2 x 4 SAS 6 Gbps ports (2. 2 GB/s

PCIe RAID Controllers? • 2 x 4 SAS 6 Gbps ports (2. 2 GB/s per x 4 port) – 1 st generation PCIe 2 – 2. 8 GB/s? – Adaptec: PCIe g 3 can do 4 GB/s – 3 x 4 SAS 6 Gbps bandwidth match PCIe 3. 0 x 8 • 6 x 4 SAS 6 Gpbs – Adaptec Series 7, PMC – 1 Chip: x 8 PCIe g 3 and 24 SAS 6 Gbps lanes • Because they could SAS x 4 6 G PCIe g 3 x 8 HBA SAS x 4 6 G

2 SSD, NAND, FLASH CONTROLLERS

2 SSD, NAND, FLASH CONTROLLERS

SSD Evolution • HDD replacement – using existing HDD infrastructure – PCI-E card form

SSD Evolution • HDD replacement – using existing HDD infrastructure – PCI-E card form factor lack expansion flexibility • Storage system designed around SSD – PCI-E interface with HDD like form factor? – Storage enclosure designed for SSD • Rethink computer system memory & storage • Re-do the software stack too!

SFF-8639 & Express Bay SCSI Express – storage over PCI-E, NVM-e

SFF-8639 & Express Bay SCSI Express – storage over PCI-E, NVM-e

New Form Factors - NGFF Enterprise 10 K/15 K HDD - 15 mm SSD

New Form Factors - NGFF Enterprise 10 K/15 K HDD - 15 mm SSD Storage Enclosure could be 1 U, 75 5 mm devices?

SATA Express Card (NGFF) Crucial m. SATA M 2

SATA Express Card (NGFF) Crucial m. SATA M 2

SSD – NAND Flash • NAND – SLC, MLC regular and high-endurance – e.

SSD – NAND Flash • NAND – SLC, MLC regular and high-endurance – e. MLC could mean endurance or embedded - differ • Controller interfaces NAND to SATA or PCIE • Form Factor – SATA/SAS interface in 2. 5 in HDD or new form factor – PCI-E interface and FF, or HDD-like FF – Complete SSD storage system

NAND Endurance Intel – High Endurance Technology MLC

NAND Endurance Intel – High Endurance Technology MLC

NAND Endurance – Write Performance Endurance SLC MLC-e MLC Write Performance Cost Structure MLC

NAND Endurance – Write Performance Endurance SLC MLC-e MLC Write Performance Cost Structure MLC = 1 MLC EE = 1. 3 SLC = 3 Process depend. 34 nm 25 nm 20 nm Write perf?

NAND P/E - Micron 34 or 25 nm MLC NAND is probably good Database

NAND P/E - Micron 34 or 25 nm MLC NAND is probably good Database can support cost structure

NAND P/E - IBM 34 or 25 nm MLC NAND is probably good Database

NAND P/E - IBM 34 or 25 nm MLC NAND is probably good Database can support cost structure

Write Endurance Vendors commonly cite single spec for range of models, 120, 240, 480

Write Endurance Vendors commonly cite single spec for range of models, 120, 240, 480 GB Should vary with raw capacity? Depends on overprovioning? 3 year life is OK for MLC cost structure, maybe even 2 year MLC 20 TB / life = 10 GB/day for 2000 days (5 years+), 20 GB/day – 3 years Vendors now cite 72 TB write endurance for 120 -480 GB capacities?

NAND • SLC – fast writes, high endurance • e. MLC – slow writes,

NAND • SLC – fast writes, high endurance • e. MLC – slow writes, medium endurance • MLC – medium writes, low endurance • MLC cost structure of $1/GB @ 25 nm – e. MLC 1. 4 X, SLC 2 X?

ONFI Open NAND Flash Interface organization • 1. 0 2006 – 50 MB/s •

ONFI Open NAND Flash Interface organization • 1. 0 2006 – 50 MB/s • 2. 0 2008 – 133 MB/s • 2. 1 2009 – 166 & 200 MB/s • 3. 0 2011 – 400 MB/s – Micron has 200 & 333 MHz products ONFI 1. 0 – 6 channels to support 3 Gbps SATA, 260 MB/s ONFI 2. 0 – 4+ channels to support 6 Gbps SATA, 560 MB/s

NAND write performance MLC 85 MB/s per 4 -die channel (128 GB) 340 MB/s

NAND write performance MLC 85 MB/s per 4 -die channel (128 GB) 340 MB/s over 4 channels (512 GB)?

Controller Interface PCIe vs. SATA NAND Some bandwidth mistmatch/overkill OK ONFI 2 – 8

Controller Interface PCIe vs. SATA NAND Some bandwidth mistmatch/overkill OK ONFI 2 – 8 channels at 133 MHz to SATA 6 Gbps – 560 MB/s a good match NAND Controller PCIe or SATA? Multiple lanes? NAND But ONFI 3. 0 is overwhelming SATA 6 Gbps? NAND 6 -8 channel at 400 MB/s to match 2. 2 GB/s x 4 SAS? 16 channel+ at 400 MB/s to match 6. 4 GB/s x 8 PCIe 3 CPU access efficiency and scaling Intel & NVM Express

Controller Interface PCIe vs. SATA PCIe SATA DRAM Controller NAND NAND NAND NAND PCIe

Controller Interface PCIe vs. SATA PCIe SATA DRAM Controller NAND NAND NAND NAND PCIe NAND Controller Vendors Vendor Channels PCIe Gen IDT 32 x 8 Gen 3 NVMe Micron 32 x 8 Gen 2 Fusion-IO 3 x 4? X 8 Gen 2?

SATA & PCI-E SSD Capacities 64 Gbit MLC NAND die 150 mm 2 25

SATA & PCI-E SSD Capacities 64 Gbit MLC NAND die 150 mm 2 25 nm 1 64 Gbit die 8 x 64 Gbit die in 1 package = 64 GB SATA Controller – 8 channels, 8 package x 64 GB = 512 GB PCI-E Controller – 32 channels x 64 GB = 2 TB 2 x 32 Gbit 34 nm 1 x 64 Gbit 25 nm 1 x 64 Gbit 29 nm

PCI-E vs. SATA/SAS • SATA/SAS controllers have 8 NAND channels – No economic benefit

PCI-E vs. SATA/SAS • SATA/SAS controllers have 8 NAND channels – No economic benefit in fewer channels? – 8 ch. Good match for 50 MB/s NAND to SATA 3 G • 3 Gbps – approx 280 MB/s realizable BW – 8 ch also good match for 100 MB/s to SATA 6 G • 6 Gbps – 560 MB/s realizable BW – NAND is now at 200 & 333 MB/s • PCI-E – 32 channels practical – 1500 pins – 333 MHz good match to gen 3 x 8 – 6. 4 GB/s BW

Crucial/Micron P 400 m & e Crucial P 400 m 100 GB 200 GB

Crucial/Micron P 400 m & e Crucial P 400 m 100 GB 200 GB 400 GB Raw 168 GB 336 GB 672 GB Seq Read (up to) 380 MB/s 380 MB/ Seq Write (up to) 200 MB/s 310 MB/s Random Read 52 K 54 K 60 K Random Write 21 K 26 K Endurance 2 M-hr MTBF 1. 75 PB 3. 0 PB 7. 0 PB Price $300? $600? $1000? Crucial P 400 e 100 GB 200 GB 400 GB 128 256 512 Seq Read (up to) 350 MB/s 350 MB/ Seq Write (up to) 140 MB/s Random Read 50 K 50 K Random Write 7. 5 K Endurance 1. 2 M-hr MTBF 175 TB Price $176 $334 $631 Preliminary – need to update Raw P 410 m SAS specs slightly different EE MLC Higher endurance write perf not lower than MLC?

Crucial m 4 & m 500 Crucial m 500 120 GB 240 GB 480

Crucial m 4 & m 500 Crucial m 500 120 GB 240 GB 480 GB 960 GB Raw 128 GB 256 GB 512 GB 1024 Seq Read (up to) 500 MB/s 500 MB/ Seq Write (up to) 130 MB/s 250 MB/s 400 MB/s Random Read 62 K 72 K 80 K Random Write 35 K 60 K 80 K Endurance 1. 2 M-hr MTBF 72 TB Price $130 $220 $400 128 GB 256 GB 512 GB 128 256 512 Seq Read (up to) 415 MB/s 415 MB/ Seq Write (up to) 175 MB/s 260 MB/s Random Read 40 K 40 K Random Write 35 K 50 K Endurance 72 TB Price $112 $212 $400 Preliminary – need to update Crucial m 4 Raw $600

Micron & Intel SSD Pricing (2013 -02) $1, 000 $900 g n ici $800

Micron & Intel SSD Pricing (2013 -02) $1, 000 $900 g n ici $800 r p m $700 $600 $500 $400 $300 $200 $100 $0 c d e e N e r or d e ct 100/128 0 0 P 4 200/256 m 400 P 400 e P 400 m S 3700 400/512 P 400 m raw capacities are 168, 336 and 672 GB (pricing retracted) Intel SSD DC S 3700 pricing $235, 470, 940 and 1880 (800 GB) respectively

4 K Write K IOPS 60 g n ici 50 r p m 40

4 K Write K IOPS 60 g n ici 50 r p m 40 30 20 10 0 c d e e N 100/128 e r or d e ct 0 0 P 4 200/256 m 400 P 400 e P 400 m S 3700 400/512 P 400 m raw capacities are 168, 336 and 672 GB (pricing retracted) Intel SSD DC S 3700 pricing $235, 470, 940 and 1880 (800 GB) respectively

SSD Summary • MLC is possible with careful write strategy – Partitioning to minimize

SSD Summary • MLC is possible with careful write strategy – Partitioning to minimize index rebuilds – Avoid full database restore to SSD • Endurance (HET) MLC – write perf? – Standard DB practice work – But avoid frequent index defrags? • SLC – only extreme write intensive? – Lower volume product – higher cost

3 DIRECT ATTACH STORAGE

3 DIRECT ATTACH STORAGE

Full IO Bandwidth QPI 192 GB PCIe x 8 PCIe x 8 PCIe x

Full IO Bandwidth QPI 192 GB PCIe x 8 PCIe x 8 PCIe x 8 RAID Infini Band RAID SSD SSD SSD HDD HDD Misc SSD HDD PCIe x 4 PCIe x 8 PCIe x 4 10 Gb. E 192 GB QPI HDD HDD • 10 PCIe g 3 x 8 slots possible – Supermicro only – HP, Dell systems have 5 -7 x 8+ slots + 1 x 4? • 4 GB per slot with 2 x 4 SAS, – 6 GB/s with 4 x 4 • Mixed SSD + HDD – reduce wear on MLC Misc devices on 2 x 4 PCIe g 2, Internal boot disks, 1 Gb. E or 10 Gb. E, graphics

System Storage Strategy QPI 192 GB QPI RAID SSD SSD HDD HDD PCIe x

System Storage Strategy QPI 192 GB QPI RAID SSD SSD HDD HDD PCIe x 8 RAID PCIe x 8 PCIe x 4 10 Gb. E IB Dell & HP only have 5 -7 slots 4 Controllers @ 4 GB/s each is probably good enough? Few practical products can use PCIe G 3 x 16 slots • Capable of 16 GB/s with initial capacity – 4 HBA, 4 -6 GB/s each • with allowance for capacity growth – And mixed SSD + HDD

Clustered SAS Storage Node 1 Node 2 QPI QPI 192 GB HBA HBA MD

Clustered SAS Storage Node 1 Node 2 QPI QPI 192 GB HBA HBA MD 3220 Dell MD 3220 supports clustering Upto 4 nodes w/o external switch (extra nodes not shown) SAS SAS Host IOC 2 GB SSD SSD HDD HDD PCIE Switch SAS Exp Host Host 2 GB IOC PCIE Exp SAS SAS Host IOC 2 GB PCIE Switch SAS Exp

Alternate SSD/HDD Strategy QPI 192 GB SSD RAID HDD HDD HDD • Primary System

Alternate SSD/HDD Strategy QPI 192 GB SSD RAID HDD HDD HDD • Primary System – All SSD for data & temp, – logs may be HDD • Secondary System – HDD for backup and restore testing PCIe x 4 SSD IB PCIe x 8 HBA PCIe x 8 PCIe x 8 PCIe x 4 10 Gb. E Backup System 192 GB QPI

System Storage Mixed SSD + HDD QPI 192 GB x 8 HBA x 8

System Storage Mixed SSD + HDD QPI 192 GB x 8 HBA x 8 x 8 x 4 10 Gb. E 192 GB QPI HBA SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD IB Each RAID Group-Volume should not exceed 2 GB/s BW of x 4 SAS 2 -4 volumes per x 8 PCIe G 3 slot SATA SSD read 350 -500 MB/s, write 140 MB/s+ 8 per volume allows for some overkill 16 SSD per RAID Controller 64 SATA/SAS SSD’s to deliver 1624 GB/s 4 HDD per volume rule does not apply HDD for local database backup, restore tests, and DW flat files SSD & HDD on shared channel – simultaneous bi-directional IO

SSD/HDD System Strategy • MLC is possible with careful write strategy – Partitioning to

SSD/HDD System Strategy • MLC is possible with careful write strategy – Partitioning to minimize index rebuilds – Avoid full database restore to SSD • Hybrid SSD + HDD system, full-duplex signalling • Endurance (HET) MLC – write perf? – Standard DB practice work, avoid index defrags • SLC – only extreme write intensive? – Lower volume product – higher cost • HDD – for restore testing

SAS Expander 2 x 4 to hosts 1 x 4 for expansion 24 x

SAS Expander 2 x 4 to hosts 1 x 4 for expansion 24 x 1 for disks Disk Enclosure expansion ports not shown

Storage Infracture – designed for HDD 15 mm 2 U • 2 SAS Expanders

Storage Infracture – designed for HDD 15 mm 2 U • 2 SAS Expanders for dual-port support – 1 x 4 upstream (to host), 1 x 4 downstream (expan) – 24 x 1 for bays

Mixed HDD + SSD Enclosure 15 mm 2 U • Current: 24 x 15

Mixed HDD + SSD Enclosure 15 mm 2 U • Current: 24 x 15 mm = 360 mm + spacing • Proposed 16 x 15 mm=240 mm + 16 x 7 mm= 120

Enclosure 24 x 15 mm and proposed Host 384 GB SAS Expander SAS SAS

Enclosure 24 x 15 mm and proposed Host 384 GB SAS Expander SAS SAS SAS Expander x 8 PCIe x 8 HBA SAS SAS x 4 6 Gpbs 2. 2 GB/s PCIe x 8 PCIe HBA Host SAS Expander 384 GB SAS x 4 12 Gpbs 4 GB/s SAS Current 2 U Enclosure, 24 x 15 mm bays – HDD or SSD 2 SAS expanders – 32 lanes each 4 lanes upstream to host 4 lanes downstream for expansion 24 lanes for bays 2 RAID Groups for SSD, 2 for HDD 1 SSD Volume on path A 1 SSD Volume on path B New SAS 12 Gbps 16 x 15 mm + 16 x 7 mm bays 2 SAS expanders – 40 lanes each 4 lanes upstream to host 4 lanes downstream for expansion 32 lanes for bays

Enclosure 24 x 15 mm and proposed 384 GB Host 384 GB x 8

Enclosure 24 x 15 mm and proposed 384 GB Host 384 GB x 8 PCIe SAS Expander SAS SAS SAS Expander HBA Host Current 2 U Enclosure, 24 x 15 mm bays – HDD or SSD 2 SAS expanders – 32 lanes each 4 lanes upstream to host 4 lanes downstream for expansion 24 lanes for bays 2 RAID Groups for SSD, 2 for HDD 1 SSD Volume on path A 1 SSD Volume on path B New SAS 12 Gbps 16 x 15 mm + 16 x 7 mm bays 2 SAS expanders – 40 lanes each 4 lanes upstream to host 4 lanes downstream for expansion 32 lanes for bays

Alternative Expansion SAS x 4 Enclosure 3 SAS x 4 HBA SAS x 4

Alternative Expansion SAS x 4 Enclosure 3 SAS x 4 HBA SAS x 4 Expander PCIe x 8 Expander Host SAS x 4 Expander Enclosure 1 SAS x 4 Enclosure 4 SAS x 4 Enclosure 2 Each SAS expander – 40 lanes, 8 lanes upstream to host with no expansion or 4 lanes upstream and 5 lanes downstream for expansion 32 lanes for bays

PCI-E with Expansion 384 GB Host 384 GB x 8 SAS Expander SAS x

PCI-E with Expansion 384 GB Host 384 GB x 8 SAS Expander SAS x 8 SAS Expander x 8 PCIe x 8 PCI-E Switch x 8 SAS x 4 6 Gpbs 2. 2 GB/s PCIe x 8 PCIe HBA Express bay form factor? Host Few x 8 ports? or many x 4 ports? • PCI-E slot SSD suitable for known capacity • 48 & 64 lanes PCI-E switches available – x 8 or x 4 ports

Enclosure for SSD (+ HDD? ) • 2 x 4 on each expander upstream

Enclosure for SSD (+ HDD? ) • 2 x 4 on each expander upstream – 4 GB/s – No downstream ports for expansion? • 32 ports for device bays – 16 SSD (7 mm) + 16 HDD (15 mm) • 40 lanes total, no expansion – 48 lanes with expansion

Large SSD Array • Large number of devices, large capacity – Downstream from CPU

Large SSD Array • Large number of devices, large capacity – Downstream from CPU has excess bandwidth • Do not need SSD firmware peak performance – 1 ) no stoppages, 2) consistency is nice • Mostly static data – some write intensive – Careful use of partitioning to avoid index rebuild and defragmentation – If 70% is static, 10% is write intensive • Does wear leveling work

4 DATABASE – SQL SERVER

4 DATABASE – SQL SERVER

Database Environment • OLTP + DW Databases are very high value – Software license

Database Environment • OLTP + DW Databases are very high value – Software license + development is huge – 1 or more full time DBA, several application developers, and help desk personnel – Can justify any reasonable expense – Full knowledge of data (where the writes are) – Full control of data (where the writes are) – Can adjust practices to avoid writes to SSD

Database – Storage Growth • 10 GB per day data growth Big company –

Database – Storage Growth • 10 GB per day data growth Big company – 10 M items at 1 KB per row (or 4 x 250 byte rows) – 18 TB for 5 years (1831 days) – Database log can stay on HDD • Heavy system – 64 -128 x 256/512 GB (raw) SSD – Each SSD can support 20 GB/day (36 TB lifetime? ) • With Partitioning – few full index rebuilds • Can replace MLC SSD every 2 years if required

Extra Capacity - Maintenance • Storage capacity will be 2 -3 X database size

Extra Capacity - Maintenance • Storage capacity will be 2 -3 X database size – It will be really stupid if you cannot update application for lack of space to modify a large table – SAN evironment • Only required storage capacity allocated • May not be able to perform maintenance ops – If SAN admin does not allocate extra space

SSD/HDD Component Pricing 2013 • • • MLC consumer MLC Micron P 400 e

SSD/HDD Component Pricing 2013 • • • MLC consumer MLC Micron P 400 e MLC endurance SLC HDD 600 GB, 10 K <$1. 0 K/TB <$1. 2 K/TB <$2. 0 K/TB $4 K? ? ? $400

Database Storage Cost • • 8 x 256 GB (raw) SSD per x 4

Database Storage Cost • • 8 x 256 GB (raw) SSD per x 4 SAS channel = 2 TB 2 x 4 ports per RAID controller, 4 TB/RC 4 RAID Controller per 2 socket system, 16 TB 32 TB with 512 GB SSD, 64 TB with 1 TB, – 64 SSD per system at $250 (MLC) – 64 HDD 10 K 600 GB $400 – Server 2 x. E 5, 24 x 16 GB, qty 2 – SQL Server 2012 EE $6 K x 16 cores $16 K $26 K $12 K each $96 K HET MLC and even SLC premium OK Server/Enterprise premium – high validation effort, low volume, high support expectations

OLTP & DW • OLTP – backup to local HDD – Superfast backup, read

OLTP & DW • OLTP – backup to local HDD – Superfast backup, read 10 GB/s, write 3 GB/s (R 5) – Writes to data blocked during backup – Recovery requires log replay • DW – example: 10 TB data, 16 TB SSD – Flat files on HDD – Tempdb will generate intensive writes (1 TB) • Database (real) restore testing – Force tx roll forward/back, i. e. , need HDD array

SQL Server Storage Configuration • IO system must have massive IO bandwidth – IO

SQL Server Storage Configuration • IO system must have massive IO bandwidth – IO over several channels • Database must be able to use all channels simultaneously – Multiple files per filegroups • Volumes / RAID Groups on each channel – Volume comprised of several devices

HDD, RAID versus SQL Server • HDD – pure sequential – not practical, •

HDD, RAID versus SQL Server • HDD – pure sequential – not practical, • impossible to maintain – Large block 256 K good enough • 64 K OK • RAID Controller – 64 K to 256 K stripe size • SQL Server – Default extent allocation: 64 K per file – With –E, 4 consecutive extents – why not 16? ? ?

File Layout Physical View 192 GB x 8 HBA 192 GB QPI x 8

File Layout Physical View 192 GB x 8 HBA 192 GB QPI x 8 x 4 10 Gb. E QPI HBA Each Filegroup and tempdb has 1 data file on every data volume IO to any object is distributed over all paths and all disks

Filegroup & File Layout Disk 2 Basic Controller 1 Port 0 File. Group A,

Filegroup & File Layout Disk 2 Basic Controller 1 Port 0 File. Group A, File 1 File. Group B, File 1 Tempdb File 1 Disk 3 Basic Controller 1 Port 1 File. Group A, File 2 File. Group B, File 2 Tempdb File 2 Disk 4 Basic Controller 2 Port 0 File. Group A, File 3 File. Group B, File 3 Tempdb File 3 Disk 5 Basic Controller 2 Port 1 File. Group A, File 4 File. Group B, File 4 Tempdb File 4 Disk 6 Basic Controller 3 Port 0 File. Group A, File 5 File. Group B, File 5 Tempdb File 5 Disk 7 Basic Controller 3 Port 1 File. Group A, File 6 File. Group B, File 6 Tempdb File 6 As shown, 2 RAID groups per controller, 1 per port. Can be 4 RG/volume per Ctlr Disk 8 Basic Controller 4 Port 0 File. Group A, File 7 File. Group B, File 7 Tempdb File 7 OS and Log disks not shown Disk 9 Basic Controller 4 Port 1 File. Group A, File 8 File. Group B, File 8 Tempdb File 8 Each File Group has 1 file on each data volume Each object is distributed across all data “disks” Tempdb data files share same volumes

RAID versus SQL Server Extents Disk 2 Basic 1112 GB Online Controller 1 Port

RAID versus SQL Server Extents Disk 2 Basic 1112 GB Online Controller 1 Port 0 Disk 3 Basic 1112 GB Online Controller 1 Port 1 Disk 4 Basic 1112 GB Online Controller 2 Port 0 Disk 5 Basic 1112 GB Online Controller 2 Port 1 Extent 17 Extent 33 Extent 2 Extent 18 Extent 34 Extent 3 Extent 19 Extent 35 Extent 4 Extent 20 Extent 36 Extent 5 Extent 21 Extent 37 Extent 6 Extent 22 Extent 38 Extent 7 Extent 23 Extent 39 Extent 8 Extent 24 Extent 40 Extent 9 Extent 25 Extent 41 Extent 13 Extent 29 Extent 45 Extent 10 Extent 26 Extent 42 Extent 14 Extent 30 Extent 46 Extent 11 Extent 27 Extent 43 Extent 15 Extent 31 Extent 47 Extent 12 Extent 28 Extent 44 Extent 16 Extent 32 Extent 48 Default: allocate 1 extent from file 1, allocate extent 2 from file 2, Disk IO – 64 K Only 1 disk in each RAID group is active

Consecutive Extents -E Disk 2 Basic 1112 GB Online Controller 1 Port 0 Disk

Consecutive Extents -E Disk 2 Basic 1112 GB Online Controller 1 Port 0 Disk 3 Basic 1112 GB Online Controller 1 Port 1 Disk 4 Basic 1112 GB Online Controller 2 Port 0 Disk 5 Basic 1112 GB Online Controller 2 Port 1 Extent 17 Extent 33 Extent 5 Extent 21 Extent 37 Extent 9 Extent 25 Extent 41 Extent 13 Extent 29 Extent 45 Extent 2 Extent 18 Extent 34 Extent 6 Extent 22 Extent 38 Extent 10 Extent 26 Extent 42 Extent 14 Extent 30 Extent 46 Extent 3 Extent 19 Extent 35 Extent 4 Extent 20 Extent 36 Allocate 4 consecutive extents from each file, OS issues 256 K Disk IO Extent 7 Extent 23 Extent 39 Extent 8 Extent 24 Extent 40 Extent 11 Extent 27 Extent 43 Extent 12 Extent 28 Extent 44 Extent 15 Extent 31 Extent 47 Extent 16 Extent 32 Extent 48 Each HDD in RAID group sees 64 K IO Upto 4 disks in RG gets IO

Storage Summary • • OLTP – endurance MLC or consumer MLC? DW - MLC

Storage Summary • • OLTP – endurance MLC or consumer MLC? DW - MLC w/ higher OP QA – consumer MLC or endurance MLC? Tempdb – possibly SLC Single log – HDD, multiple logs: SSD? Backups/test Restore/Flat files – HDD No caching, no auto-tiers

SAN

SAN

Software Cache + Tier

Software Cache + Tier

Cache + Auto-Tier Good idea if 1) No knowledge 2) No control In Database

Cache + Auto-Tier Good idea if 1) No knowledge 2) No control In Database We have 1) Full knowledge 2) Full Control Virtual file stats Filegroups partitioning

Common SAN Vendor Configuration Node 1 Node 2 768 GB Switch 8 Gbps FC

Common SAN Vendor Configuration Node 1 Node 2 768 GB Switch 8 Gbps FC or Switch 10 Gbps FCOE SP A SP B 24 GB x 4 SAS 2 GB/s Main Volume Log volume SSD 10 K 7. 2 K Hot Spares Multi-path IO: perferred port alternate port Single large volume for data, additional volumes for log, tempdb, etc All data IO on single FC port 700 MB/s IO bandwidth Path and component fault-tolerance, poor IO performance

Multiple Paths & Volumes 3 Node 1 768 GB Node 2 768 GB x

Multiple Paths & Volumes 3 Node 1 768 GB Node 2 768 GB x 8 x 8 x 8 SSD 8 Gb FC Switch SP A Multiple local SSD for tempdb Multiple quad-port FC HBAs SP B 24 GB x 4 SAS 2 GB/s Data 1 Data 2 Data 3 Data 4 Data 5 Data 6 Data 7 Data 8 Data 9 Data 10 Data 11 Data 12 Data 13 Data 14 Data 15 Data 16 SSD 1 SSD 2 SSD 3 SSD 4 Log 1 Log 2 Log 3 Log 4 Many SAS ports Data files must also be evenly distributed Optional SSD volumes

Multiple Paths & Volumes 2 Node 1 768 GB Node 2 768 GB x

Multiple Paths & Volumes 2 Node 1 768 GB Node 2 768 GB x 8 x 8 x 8 SSD 8 Gb FC Switch SP A Multiple local SSD for tempdb Multiple quad-port FC HBAs SP B 24 GB x 4 SAS 2 GB/s Data 1 Data 2 Data 3 Data 4 Data 5 Data 6 Data 7 Data 8 Data 9 Data 10 Data 11 Data 12 Data 13 Data 14 Data 15 Data 16 SSD 1 SSD 2 SSD 3 SSD 4 Log 1 Log 2 Log 3 Log 4 Many SAS ports Data files must also be evenly distributed Optional SSD volumes

8 Gbps FC rules • 4 -5 HDD RAID Group/Volumes – SQL Server with

8 Gbps FC rules • 4 -5 HDD RAID Group/Volumes – SQL Server with –E only allocates 4 consecutive extents • 2+ Volumes per FC port – Target 700 MB/s per 8 Gbps FC port • SSD Volumes – Limited by 700 -800 MB/s per 8 Gbps FC port – Too many ports required for serious BW – Management headache from too many volumes

SQL Server • SQL Server table scan to – heap generates 512 K IO,

SQL Server • SQL Server table scan to – heap generates 512 K IO, easy to hit 100 MB/s/disk – (clustered) index 64 K IO, 30 -50 MB/s per disk likely

EMC VNX 5300 FT DW Ref Arch

EMC VNX 5300 FT DW Ref Arch

i. SCSI & File structure x 4 x 4 x 4 10 Gb. E

i. SCSI & File structure x 4 x 4 x 4 10 Gb. E RJ 45 Controller 1 DB 1 files SFP+ Controller 2 DB 2 files RJ 45 Controller 1 DB 1 file 1 DB 2 file 1 SFP+ Controller 2 DB 1 file 2 DB 2 file 2

EMC VMAX

EMC VMAX

EMC VMAX orig and 2 nd gen · · · 2. 3 GHz Xeon

EMC VMAX orig and 2 nd gen · · · 2. 3 GHz Xeon (Harpertown) 16 CPU cores 128 GB cache memory (maximum) Dual Virtual Matrix PCIe Gen 1 · · · 2. 8 GHz Xeon w/turbo (Westmere) 24 CPU cores 256 GB cache memory (maximum) Quad Virtual Matrix PCIe Gen 2

EMC VMAX 10 K

EMC VMAX 10 K

EMC VMAX Virtual Matrix

EMC VMAX Virtual Matrix

VMAX Director

VMAX Director

EMC VMAX Director VMAX 10 K new Upto 4 engines, 1 x 6 c

EMC VMAX Director VMAX 10 K new Upto 4 engines, 1 x 6 c 2. 8 G per dir 50 GB/s VM BW? 16 x 8 Gbps FC per engine FC HBA SAS IOH Director SAS FC HBA VMI SAS VMAX 20 K Engine 4 QC 2. 33 GHz 128 GB Virtual Maxtrix BW 24 GB/s System - 8 engines, 1 TB, VM BW 192 GB/s, 128 FE ports IOH IOH VMI VMI VMAX Engine? SAS Director VMAX 40 K Engine 4 SC 2. 8 GHz 256 GB Virtual Maxtrix BW 50 GB/s System - 8 engines, 2 TB, VM BW 400 GB/s, 128 FE ports Rapid. IO IPC 3. 125 GHz, 2. 5 Gb/s 8/10 4 lanes per connection 10 Gb/s = 1. 25 GB/s, 2. 5 GB/s full duplex 4 Conn per engine - 10 GB/s 36 PCI-E per IOH, 72 combined 8 FE, 8 BE 16 VMI 1, 32 VMI 2

SQL Server Default Extent Allocation Data Extent 1 Extent 5 Extent 9 Extent 13

SQL Server Default Extent Allocation Data Extent 1 Extent 5 Extent 9 Extent 13 file 1 Extent 17 Extent 21 Extent 25 Extent 29 Extent 33 Extent 37 Extent 41 Extent 45 Data Extent 2 Extent 6 Extent 10 Extent 14 file 2 Extent 18 Extent 22 Extent 26 Extent 30 Extent 34 Extent 38 Extent 42 Extent 46 Data Extent 3 Extent 7 Extent 11 Extent 15 file 3 Extent 19 Extent 23 Extent 27 Extent 31 Extent 35 Extent 39 Extent 43 Extent 47 Allocate 1 extent per file in round robin Proportional fill EE/SE table scan tries to stay 1024 pages ahead? Data Extent 4 Extent 8 Extent 12 Extent 16 file 4 Extent 20 Extent 24 Extent 28 Extent 32 Extent 36 Extent 40 Extent 44 Extent 48 SQL can read 64 contiguous pages from 1 file. The storage engine reads index pages serially in key order. Partitioned table support for heap organization desired?

SAN Node 1 Node 2 768 GB Node 1 QPI 8 Gb FC Switch

SAN Node 1 Node 2 768 GB Node 1 QPI 8 Gb FC Switch SP A SP B 24 GB QPI 192 GB HBA x 4 SAS 2 GB/s Volume 1 Data Volume 2 Data Volume 3 Data Volume 4 Data Volume. . Data Volume 15 Data Volume 16 Data Volume - Log SSD 1 SSD 2 SSD. . . SSD 8 Log SSD Node 2 10 K HBA 192 GB HBA

Clustered SAS In Node 2 SAS Out Node 1 Node 2 768 GB QPI

Clustered SAS In Node 2 SAS Out Node 1 Node 2 768 GB QPI 192 GB Node 1 768 GB SAS SAS Host IOC 2 GB PCIE Switch SAS Exp HBA SAS SAS Host IOC 2 GB Host IOC SAS Exp PCIE Switch RAID SSD SSD HDD HDD PCIe x 8 PCIe x 8 PCIe x 4 10 Gb. E HBA IB Switch SAS Exp 192 GB HBA HBA Host IOC Switch SAS Exp HBA 192 GB HBA Host HBA

Fusion-IO io. Scale

Fusion-IO io. Scale