SQL Server 2008 R 2 Parallel Data Warehouse

  • Slides: 43
Download presentation
SQL Server 2008 R 2 Parallel Data Warehouse Steve Busby Data Warehouse Technology Specialist

SQL Server 2008 R 2 Parallel Data Warehouse Steve Busby Data Warehouse Technology Specialist Microsoft Corporation Steve. Busby@microsoft. com Facebook: msstevebus Twitter: @Bama. Steve. B

About Steve � 14 year MSFT veteran �MCS, then MTC �Appdev/SQL background (OLTP) �But

About Steve � 14 year MSFT veteran �MCS, then MTC �Appdev/SQL background (OLTP) �But Infrastructure too �Just moved back “home” �Graysville, Minor High (‘ 89) �BSEE – UAB �Former Southern Company Services

Some caveats �This presentation is about V 1 of PDW �V 2 is releasing

Some caveats �This presentation is about V 1 of PDW �V 2 is releasing December’ish �New HW and SW �will be launched at SQLPASS �Can’t talk publicly about V 2 (NDA only) �Significant performance increases and cost decreases �Incentive to buy now though (procs vs cores)

One more caveat! • I’m not new to Microsoft, but new to role (and

One more caveat! • I’m not new to Microsoft, but new to role (and I’m rusty on DW) • You guys get to be my lab rats

MANAGE ANY DATA, ANY SIZE, ANYWHERE 01010101010 010101010

MANAGE ANY DATA, ANY SIZE, ANYWHERE 01010101010 010101010

A Complete Business Analytic Platform Integration with SQL Server BI Complementary tools Non-Microsoft BI

A Complete Business Analytic Platform Integration with SQL Server BI Complementary tools Non-Microsoft BI & ETL Tools SQL Server Integration Services, Analysis Services Power. Pivot & Power View Hadoop Stream. Insight Master Data Services Informatica Business Objects Microstrategy and SAS 6

Microsoft & Data Warehouse Alerts, Notifications SQL Server Stream. Insight Big Data Sources (Raw,

Microsoft & Data Warehouse Alerts, Notifications SQL Server Stream. Insight Big Data Sources (Raw, Unstructured) Data & Compute Intensive Application Business Insights SQL Server FTDW Data Marts Sensors Load Summarize & Load Devices Fast SQL Server Reporting Services Hadoop on Windows Azure Bots Hadoop on Windows Server SQL Server Parallel Data Warehouse Historical Data (Beyond Active Window) Interactive Reports Integrate/Enrich SQL Server Analysis Server Crawlers Performance Scorecards Enterprise ETL with SSIS, DQS, MDS ERP CRM LOB Source Systems APPS

MICROSOFT DATA WAREHOUSING SOLUTIONS Reference Architectures offering best price performance Appliance for Data Warehousing

MICROSOFT DATA WAREHOUSING SOLUTIONS Reference Architectures offering best price performance Appliance for Data Warehousing requiring highest scalability, performance or complexity Ideal for data marts or small to mid-sized DWs with scan centric workloads Offers flexibility in hardware and architecture Software only Reference Architectures (Software and Hardware) DW Appliance (Fully integrated Software and Hardware) Scale-Up DW Scale-Out DW with MPP Scalable and reliable platform Ideal for data marts or small to mid-sized EDWs Tier 1 Services and Support

Some SQL Data Warehouses today Big SAN Big 64 -core Server Connected together What’s

Some SQL Data Warehouses today Big SAN Big 64 -core Server Connected together What’s wrong with this picture?

Answer: system out of balance • This server can consume 16 GB/Sec of IO,

Answer: system out of balance • This server can consume 16 GB/Sec of IO, but the SAN can only deliver 2 GB/Sec − Even when the SAN is dedicated to the SQL Data Warehouse, which it often isn’t • System is typically IO bound • Queries are slow Result: significant investment, not delivering performance

The Alternative: A Balanced System • Design a server + storage configuration that can

The Alternative: A Balanced System • Design a server + storage configuration that can deliver all the IO bandwidth that CPUs can consume when executing a SQL Relational DW workload • Avoid sharing storage devices among servers

Fast Track Data Warehouse Components Balanced across all components CPU Feed Rate B FC

Fast Track Data Warehouse Components Balanced across all components CPU Feed Rate B FC HBA A A B FC HBA A B SQL Server Read Ahead Rate B DISK HBA Port Rate A A B STORAGE CONTROLLER CACHE A FC SWITCH CPU CORES WINDOWS SQL SERVER CACHE SERVER SQL Server 2008 R 2 Potential Performance Bottlenecks A B Switch Port Rate LUN DISK B SP Port Rate DISK LUN Read Rate Disk Feed Rate

Two SQL DW Infrastructure Options: SQL Classic DW or Fast Track SQL DW SQL

Two SQL DW Infrastructure Options: SQL Classic DW or Fast Track SQL DW SQL Classic DW Architecture Leverages Shared SAN Enterprise Shared SAN Storage Shared Network Bandwidth Fast Track SQL DW Architecture modeled after DW Appliances “ Appliance Like” solutions Uses Dedicated SAN arrays and Network Dedicated Network Bandwidth Dedicated SAN SQL 2008 Data Warehouse SMP Server OLTP Applications SQL Fast Track DW supports “Scan Centric” DW workloads that are index light

Balanced System Scale UP with Fast Track v 3 Peak Scan rates 14 GBs/sec

Balanced System Scale UP with Fast Track v 3 Peak Scan rates 14 GBs/sec – Proliant DL 980 CPU Socket (8 Core) CPU Socket (8 Core) Fiber Switch Storage Processor Dual 8 GB Ports X 4 per P 2000 CPU Socket (8 Core) Storage Processor Storage Processor HBA HBA Storage Processor HBA Server Storage Processor HBA Storage Processor RAID-1 RAID-1 RAID-1 RAID-1 RAID-1 Storage Enclosure RAID-1 RAID-1 RAID-1 RAID-1 Storage Enclosure RAID-1 RAID-1 Storage Enclosure

SQL Server Parallel Data Warehouse A data warehouse appliance with massive scalability What do

SQL Server Parallel Data Warehouse A data warehouse appliance with massive scalability What do you do when the combination of your analytic workload and data volumes exceed the capabilities of SQL Server 2008 R 2 and Fast Track Reference Architecture?

PDW - Massive Scale and Performance Complete Business Analytics Platform Massive Scale & Performance

PDW - Massive Scale and Performance Complete Business Analytics Platform Massive Scale & Performance Simplicity & Value Benefits 16

Parallel Data Warehouse An appliance experience �All hardware from a single vendor �Multiple vendors

Parallel Data Warehouse An appliance experience �All hardware from a single vendor �Multiple vendors to chose from �Orderable at the rack level �Vendor will: � Assemble appliances � Image appliances with OS, SQL Server, and PDW software �Appliance installed in 1 – 2 days �Support: � Microsoft provides first call support � Hardware partner provides onsite break/fix support

Linear Scalability 1 Data Rack 4 Data Rack: • • 10 Servers 20 Procs

Linear Scalability 1 Data Rack 4 Data Rack: • • 10 Servers 20 Procs 240 Cores 1. 28 TB RAM Control Rack Data. Rack Expand to 4 data racks and quadruple your performance or capacity! 4 o Servers 80 Procs 960 Cores 5. 12 TB RAM

Dell/Microsoft PDW Architecture Scales for Resilience and High Performance, with Low Cost of Entry

Dell/Microsoft PDW Architecture Scales for Resilience and High Performance, with Low Cost of Entry Control Rack Data Racks (up to 4) Power. Vault MD 3620 f Storage Nodes Power. Edge R 610 Database Servers Control Nodes (R 710) Active / Passive Data Center Monitoring Dual Infiniband Management Servers (R 610) Dual Fibre Channel Client Drivers Landing Zone (R 510) ETL Load Interface Backup Node (R 710 and MD 3600 f w/MD 1200’s) Corporate Backup Solution Spare Database Server Corporate Network Private Network

PDW 1 Data Rack Appliance Control Rack Data Rack Compute Nodes Designed Specifically for

PDW 1 Data Rack Appliance Control Rack Data Rack Compute Nodes Designed Specifically for DW Workloads Control Nodes Storage Nodes SQL Active / Passive SQL SQL True MPP shared nothing architecture Management Servers Landing Zone Dual Infiniband Linear scalability by adding additional data racks. SQL SQL High Availability Built in Backup Node SQL Spare Compute Node Dual Fiber Channel SQL

Control Rack– Designated Server Workloads Control Rack An Appliance = 1 Control Rack and

Control Rack– Designated Server Workloads Control Rack An Appliance = 1 Control Rack and 1 to 4 Data Racks Control Nodes Active / Passive Control Node Takes submitted statement and creates D-SQL plan SQL Landing Zone Backup Node Dual Infiniband Management Servers Management Server Patch Management, Active Directory, HPC monitoring Landing Zone Parallelizes loads. Tested to 1 -2 TB an hour. LZ has 6. 5 TB storage Backup Node Has enough storage to match data configuration. Backup Speeds 4 -6 TB an hour

Power of 10 Compute Nodes Data Rack Compute Nodes Each Compute Node complete independent

Power of 10 Compute Nodes Data Rack Compute Nodes Each Compute Node complete independent SQL server. True MPP architecture. 24 cores, 128 GB Ram, per node Storage Nodes SQL SQL Data Rack: 1280 GB RAM 240 Cores Rack Capacity: up to 153 TB Dual Infiniband SQL SQL Scan Rates: up to 3. 6 TB min SQL Spare Compute Node Dual Fiber Channel SQL

PDW Benefits – Massive Parallel Processing Storage Nodes Compute Nodes Active / Passive SQL

PDW Benefits – Massive Parallel Processing Storage Nodes Compute Nodes Active / Passive SQL Management Servers Landing Zone Backup Node Dual Infiniband Query 1 ? ? ? ? ? SQL SQL Dual Fiber Channel Control Nodes Query 1 is submitted to SQL Server on Control Node SQL Spare Compute Node Query is executed on all 10 Nodes Results are sent back to client

Built In High Availability Failover Clusters Mirrored drives Dual networks Hot swap drive Dual

Built In High Availability Failover Clusters Mirrored drives Dual networks Hot swap drive Dual power supplies Dual cooling fans Price your appliance knowing you have no hidden costs for HA.

More secret sauce….

More secret sauce….

MPP Data Layout Options Replicated • A table structure that exists as A full

MPP Data Layout Options Replicated • A table structure that exists as A full copy within each discrete DBMS instance. Distributed • A table structure that is hashed on a single column and uniformly distributed across all nodes on the appliance. Each distribution is A separate physical table in the DBMS.

Basic Physical DB Design in PDW Time Dim Product Dim Date Dim ID Calendar

Basic Physical DB Design in PDW Time Dim Product Dim Date Dim ID Calendar Year Calendar Qtr Calendar Mo Calendar Day Prod Dim ID Prod Category Prod Sub Cat Prod Desc SQL Sales Facts Date Dim ID Store Dim ID Prod Dim ID Mktg Camp Id Qty Sold Dollars Sold Store Dim ID Store Name Store Mgr Store Size SQL Mktg Campaign Dim Mktg Camp ID Camp Name Camp Mgr Camp Start Camp End SQL

Basic Physical DB Design in PDW Distributed Tables Time Dim Product Dim Date Dim

Basic Physical DB Design in PDW Distributed Tables Time Dim Product Dim Date Dim ID Calendar Year Calendar Qtr Calendar Mo Calendar Day Prod Dim ID Prod Category Prod Sub Cat Prod Desc Store Dim ID Store Name Store Mgr Store Size SQL Sales Facts Date Dim ID Store Dim ID Prod Dim ID Mktg Camp Id Qty Sold Dollars Sold Larger Fact Table is Hash Distributed Across All Compute Nodes SQL Mktg Campaign Dim Mktg Camp ID Camp Name Camp Mgr Camp Start Camp End SQL SF SF -1 -1 -1 -1 SF SF -1 -1 -2 -1 SF SF -1 -1 -3 -1 SF SF -1 -1 -4 -1

Create Table Syntax CREATE TABLE Fact. Sales ( Product. Key INT NOT NULL ,

Create Table Syntax CREATE TABLE Fact. Sales ( Product. Key INT NOT NULL , Order. Date. Key INT NOT NULL , Due. Date. Key INT NOT NULL , Ship. Date. Key INT NOT NULL , Reseller. Key INT NOT NULL , Employee. Key INT NOT NULL , Promotion. Key INT NOT NULL , Currency. Key INT NOT NULL , Sales. Territory. Key INT NOT NULL , Sales. Order. Number VARCHAR(20) NOT NULL, … ) WITH ( DISTRIBUTION = HASH(Product. Key), CLUSTERED INDEX(Order. Date. Key) , PARTITION (Order. Date. Key RANGE RIGHT FOR VALUES ( 20010601, 20010901, … ) ) ); Control Node Create table metadata on Control Node Send Create Table SQL to each compute node Create Table Fact. Sales_A Create Table Fact. Sales_B Create Table Fact. Sales_C …… Create Table Fact. Sales_H Compute Node 1 Compute Node 2 … Compute Node 10 Fact. Sale Fact. Sale s_A s_A s_A s_A Fact. Sale Fact. Sale s_A s_A Fact. Sale Fact. Sale s_B s_B s_B s_B Fact. Sale Fact. Sale s_B s_B Fact. Sale Fact. Sale s_C s_C s_C s_C Fact. Sale Fact. Sale s_C s_C Fact. Sale Fact. Sale s_D s_D s_D s_D Fact. Sale Fact. Sale s_D s_D Fact. Sale Fact. Sale s_E s_E s_E s_E Fact. Sale Fact. Sale s_E s_E Fact. Sale Fact. Sale s_F s_F s_F s_F Fact. Sale Fact. Sale s_F s_F Fact. Sale Fact. Sale s_G s_G s_G s_G Fact. Sale Fact. Sale s_G s_G Fact. Sale Fact. Sale s_H s_H s_H s_H Fact. Sale Fact. Sale s_H s_H

Basic Physical DB Design in PDW Replicated. Tables Time Dim Product Dim Date Dim

Basic Physical DB Design in PDW Replicated. Tables Time Dim Product Dim Date Dim ID Calendar Year Calendar Qtr Calendar Mo Calendar Day Prod Dim ID Prod Category Prod Sub Cat Prod Desc Sales Facts Date Dim ID Store Dim ID Prod Dim ID Mktg Camp Id Qty Sold Dollars Sold Store Dim ID Store Name Store Mgr Store Size Mktg Campaign Dim Mktg Camp ID Camp Name Camp Mgr Camp Start Camp End Smaller Dimension Tables are Replicated on Every Compute Node SQL T D S D P D M D SF SF -1 -1 -2 -1 SQL T D S D P D M D SF SF -1 -1 -3 -1 SQL T D S D P D M D SF SF -1 -1 -4 -1 Result: Fact -Dimension Joins can be performed locally SF SF -1 -1 -1 -1

Storage Node – Drive Array (User Databases) User Database Small Form Factor Drives (SFF)

Storage Node – Drive Array (User Databases) User Database Small Form Factor Drives (SFF) (24) – 8 RAID 1 groups containing a single LUN Large Form Factor Drives (LFF) (11) – 4 RAID 1 groups containing 2 LUNs each LUN 1 LUN 2 LUN 3 LUN 4 LUN 5 LUN 6 LUN 7 LUN 8 LUN 9 SFF – 6 disk in a RAID 1+0 group LFF – 2 disk in a RAID 1 group Log LUN 9 SFF - 2 Hot Spares LFF – 1 Hot Spare (SFF) 32

User Database Storage Node – Physical File Layout LUN 1 LUN 2 LUN 3

User Database Storage Node – Physical File Layout LUN 1 LUN 2 LUN 3 LUN 4 LUN 5 LUN 6 LUN 7 LUN 8 FG UDB_Dist_A FG UDB_Dist_B FG UDB_Dist_C FG UDB_Dist_D FG UDB_Dist_E FG UDB_Dist_F FG UDB_Dist_G FG UDB_Dist_H UDBDist. ndf UDBRepl. ndf UDBDist. ndf UDB Replicated FG UDBRepl. ndf LUN 9 Log LUN 9 UDBRepl. ndf 8 distribution filegroups 1 replicated filegroup with 8 files 1 log filegroup 33

Parallelism with Distributed Tables 2 Data Rack system • PDW, distributes a single large

Parallelism with Distributed Tables 2 Data Rack system • PDW, distributes a single large logical table across 160 tables across the 20 servers. • The distribution is performed by selecting a column in each table and applying a hash function to it. Fact. Orders Order. ID Order. Date. Key Order. Time. Key Clustered Index = Order. Date. Key Fact. Trades Partition = Order. Date. Key Trade. Time. Key Partition = Order. Date. Key Current. Order. ID Customer. Key Total. Qty Trade. ID Clustered Index = Order. Date. Key Distribution = Order. ID Old. Order. ID … Distribution = Current. Order. ID 34

Parallelism with Distributed Tables Distribution Compatible Join Node 1 Fact Orders Fact Trades Fact

Parallelism with Distributed Tables Distribution Compatible Join Node 1 Fact Orders Fact Trades Fact Orders Order. ID Order. Date. Key Order. Time. Key Customer. Key Node 2 Fact Orders Fact Trades = …. Total. Qty = Fact. Trades Trade. ID Trade. Date. Key Trade. Time. Key Current. Order. ID Old. Order. ID Node 20 Fact Orders Fact Trades = SELECT Order. ID, COUNT(Trade. ID) FROM Fact. Orders JOIN Fact. Trades ON Fact. Orders. Order. ID = Fact. Trades. Current. Order. ID WHERE …… 35

Parallelism with Distributed Distribution Incompatible Join Shuffle Move Node 1 Fact Orders Fact Trades

Parallelism with Distributed Distribution Incompatible Join Shuffle Move Node 1 Fact Orders Fact Trades Fact Orders Order. ID Order. Date. Key Order. Time. Key Temp. DB Node 2 Fact Orders Fact Trades Customer. Key …. Total. Qty = = Fact. Trades Trade. ID Trade. Date. Key Trade. Time. Key Current. Order. ID Temp. DB Old. Order. ID Node 20 Fact Orders Fact Trades = Temp. DB SELECT Order. ID, COUNT(Trade. ID) FROM Fact. Orders JOIN Fact. Trades ON Fact. Orders. Order. ID = Fact. Trades. Old. Order. ID WHERE …… 36

? ? ? Control Rack Compute Nodes ? ? Control Nodes ( Active /

? ? ? Control Rack Compute Nodes ? ? Control Nodes ( Active / Passive ? ? ? ? Management Servers Load Data Rack Landing Zone LLLLL Backup Node ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Storage Nodes Dual Fiber Channel ? ? Dual Infiniband ? Massive Scale and Performance with Parallel Data Warehouse Spare Database Server Multiple queries are simultaneously executed across all nodes. PDW supports querying while data is loading.

PDW Data Loader Illustrated Distributed Tables Control Rack Data Rack Control Node Active/Passive Database

PDW Data Loader Illustrated Distributed Tables Control Rack Data Rack Control Node Active/Passive Database Server Nodes Storage Nodes DWLoader Invoked from Command line DMS Reads Load File and buffers records to Sends to Compute Nodes roundrobin SQL Server PDWEngine Load Manager Creates Staging Tables Load Manager Export Manager DMS Infiniband DMS Ser er Converter Sender Receiver Writer Each row is converted for bulk insert and optionally hashed Hashed row is sent to appropriate node receiver for loading Landing Zone Load File Load Client DMS Converter Sender Receiver Writer Received row is pushed onto writer thread Row is bulk inserted into staging table

Distributed Data Warehouse Architecture Departmental Reporting Central EDW Hub Accessible from Anywhere SQL Server

Distributed Data Warehouse Architecture Departmental Reporting Central EDW Hub Accessible from Anywhere SQL Server SQL Database Landing Zone High-Performance Reporting Regional Reporting SQL Server Analysis Services SQL Server Fast. Track 3 rd Party Data Integration ETL Tools Data Files 3 rd Party RDBMS

Getting you there with DW Appliance Support • − − • − − −

Getting you there with DW Appliance Support • − − • − − − • − − − Provided by the hardware vendor support group. Included in the MCAP- PMC pricing at no cost between July 2010 and June 2011 – May be delivered by MCS or a Microsoft Partner – will also be provided to customers who do not purchase PMC 3 PMC features listed is basic PMC offering, PMC can be customized and configured to the customers environment. 1 2

PDW Momentum – Customer Wins in FY 12

PDW Momentum – Customer Wins in FY 12

DIRECTEDGE gets 142 X Performance Gain with Parallel Data Warehouse (PDW) Appliance Benefits 42

DIRECTEDGE gets 142 X Performance Gain with Parallel Data Warehouse (PDW) Appliance Benefits 42

Hy-Vee gets 100 X Performance Gain with Parallel Data Warehouse (PDW) Appliance Benefits 43

Hy-Vee gets 100 X Performance Gain with Parallel Data Warehouse (PDW) Appliance Benefits 43

© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows Vista and other product names

© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U. S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.