Smith Matt F matthew smithdell com SQL Server

  • Slides: 31
Download presentation
Smith, Matt F | matthew. smith@dell. com SQL Server Poly. Base and Dell EMC

Smith, Matt F | matthew. smith@dell. com SQL Server Poly. Base and Dell EMC Isilon storage

A little bit about me. • I work for the Dell EMC Big Data/Io.

A little bit about me. • I work for the Dell EMC Big Data/Io. T Solution Engineering Team as a Solutions Architect (consulting & delivery) • Started with SQL Server 7 back in the Dot-com days doing application development: ASP 3. 0, VB 6 and SQL Server. • Lots of work with SQL Server over the years: Dev, DBA, BI • Moved into the Big Data space a few years ago: Pivotal, Hortonworks, Cloudera.

Why I chose this topic. • • Personal interest in creating a performant entry-level

Why I chose this topic. • • Personal interest in creating a performant entry-level Big Data solution for small to mid-sized Hadoop File System data sets (1 -5 TB). Reduce the common barriers of entry to Big Data: Training and experience, financial investment, infrastructure. Introduce Big Data into organizations through known & trusted products, leveraging existing skill sets. Realize value in helping DBA’s, developers and organizations start their big data journeys through low-risk pilot projects.

SQL Server Poly. Base – a quick overview • Introduced and initially a component

SQL Server Poly. Base – a quick overview • Introduced and initially a component of SQL Server 2012 Parallel Data Warehouse (PDW), now known as Analytics Platform System (APS) • Added to the SQL Server Product Family in 2016 • Allows you to access semi-structured data (Schema on Read) located in an HDFS compliant file-storage through SQL Server • Connect to Dell EMC Isilon, Azure blob storage, Hadoop or Cloudera

What is Isilon? • Scale-out, Multi-protocol Network Attached Storage • Write with NFS, SMB

What is Isilon? • Scale-out, Multi-protocol Network Attached Storage • Write with NFS, SMB 3, FTP, HTTP and read immediately with another protocol • 7000+ customers, scales from 15 TB to 68 PB • One. FS file system – Isilon OS • Each node includes processors, ram, network and disk. Compute on every node. Network: 10 Gb. E, 40 Gb. E. • Wide selection of disk choices for nodes allow for tiering of hot data (Flash), Hybrid and Archive storage. Automate data movement across storage tiers • Active Directory & Kerberos integration

Integrated Isilon and Poly. Base NFS Web Click data Poly. Base Cluster Step 2:

Integrated Isilon and Poly. Base NFS Web Click data Poly. Base Cluster Step 2: Jobs are run Decision Support Databases name node EDW Step 1: Much or all of the Data lives on the Isilon/Hadoop Cluster Head Compute data node OLAP SMB, NFS, HTTP, FTP, HDFS node info name node Isilon

POC Architecture – MS SQL with Isilon HDFS Clickstream Isilon services as HDFS and

POC Architecture – MS SQL with Isilon HDFS Clickstream Isilon services as HDFS and Name-Nodes Enables Microsoft T-SQL queries in the HADOOP environment Parallel operation on HADOOP and MS SQL in the same database DSS Sensor MS SQL Integration 1. Direct queries from SQL to Isilon HDFS 2. External Pushdown OLAP EDW MS SQL 2016 Enterprise Edition Poly. Base Scale. Out Group Very efficient methods (SMB, FTP, NFS, HDFS) for data import by DELL EMC ISILON Multiple Hadoop applications can even simultaneously access the same dataset (Isilon) at the same time

Configuring Isilon for Poly. Base POC Steps 1. Create an Access Zone 2. Configure

Configuring Isilon for Poly. Base POC Steps 1. Create an Access Zone 2. Configure HDFS Service 3. Configure Network 4. Create pdw_user 5. Add Active Directory NS record & test

Configuring Isilon – POC • • Verify Licenses Create an Access Zone

Configuring Isilon – POC • • Verify Licenses Create an Access Zone

Configuring Isilon – POC Configure HDFS

Configuring Isilon – POC Configure HDFS

Configuring Isilon – POC Configure Smart. Connect

Configuring Isilon – POC Configure Smart. Connect

Configuring Isilon – POC Create pdw_user

Configuring Isilon – POC Create pdw_user

Configuring Isilon – POC Add DNS entry, test.

Configuring Isilon – POC Add DNS entry, test.

Configuring Isilon - Resources EMC Isilon Best Practices Guide for Hadoop Data Storage -

Configuring Isilon - Resources EMC Isilon Best Practices Guide for Hadoop Data Storage - Dell EMC https: //www. emc. com/collateral/. . . /h 12877 -wp-emcisilon-hadoop-best-practices. pdf One. FS with HDFS Reference Guide - Dell EMC https: //www. emc. com/collateral/Technical. Document/doc u 84284. pdf

Install & Enable Poly. Base on SQL Server Install the Oracle Java SE Runtime

Install & Enable Poly. Base on SQL Server Install the Oracle Java SE Runtime Environment (JRE) 7. 51 (x 64) or 8. Do not install JRE 9!

Poly. Base Data Sources • • Data sources include Isilon, Azure, Hortonworks and Cloudera

Poly. Base Data Sources • • Data sources include Isilon, Azure, Hortonworks and Cloudera Add Multiple Data Sources (pending support) • Hortonworks HDP 1. 3, 2. 0, 2. 1, 2. 2 • Cloudera CDH 4. 3, 5. 1

Poly. Base External File Format • • • File formats are required You create

Poly. Base External File Format • • • File formats are required You create and define your own File Formats Supported File Formats include: • • • Text (delimited) Hive ORC – Optimized Row Columnar Hive RCFile - Record Columnar (key-value)

Poly. Base External Tables • • External Tables are schema on read You define

Poly. Base External Tables • • External Tables are schema on read You define external table columns

Table Statistics Add statistics to optimize query performance

Table Statistics Add statistics to optimize query performance

Query Testing. . . Used Adam Machanic’s sp_Who. Is. Active so I could get

Query Testing. . . Used Adam Machanic’s sp_Who. Is. Active so I could get a better idea of what was going on. Observations… • Lots of Temp Tables! • Some data moving back and forth • Things seemed to take a long time • Data came back!

Insert into / Select from worked best. Insert into / Select From to Load

Insert into / Select from worked best. Insert into / Select From to Load the data you want to work with locally • Create indexes (Columnstore or other) and then join to local tables for best performance • Joining from local tables to external tables just didn’t work well • Nightly ETL processes to move relevant data from HDFS External Tables into SQL Server makes sense Opportunities for Improvement • Add additional Scale-out compute nodes (I plan to test SQL Express!)…also Containers & Powershell provisioning • Temp. DB on flash or SSD

DMV’s for troubleshooting and analysis Many DMV’s exist. The link at the bottom of

DMV’s for troubleshooting and analysis Many DMV’s exist. The link at the bottom of this slide includes a great process on how to use them. https: //msdn. microsoft. com/library/ce 9078 b 7 -a 750 -4 f 47 -b 23 e-90 b 83 b 783 d 80

Planning for Growth Data lakes tend to fill up.

Planning for Growth Data lakes tend to fill up.

HDFS: Standard Hadoop Cluster NFS reply Decision Support Databases HTTP CIFS FTP OLAP EDW

HDFS: Standard Hadoop Cluster NFS reply Decision Support Databases HTTP CIFS FTP OLAP EDW Compute Data Name node Node node Web Click data NFS Step 1: Data is copied into the Landing Zone file node copy 2 copy 3 info Landing Zone Servers node file info HDFS Step 2: Data is copied into the Cluster (3 times) node file info copy 3 copy 2 3 X node file info copy 2 copy 3 MAP Reduce MAP Reduce MAP Reduce Step 3: Hadoop Jobs are run

The EMC Isilon Advantage for Analytics HDFS 1 Scale-Out Storage Platform – Multiple applications

The EMC Isilon Advantage for Analytics HDFS 1 Scale-Out Storage Platform – Multiple applications & workflows 2 No Single Point of Failure – Distributed Name. Node 3 End-to-End Data Protection – Snapshot. IQ, Sync. IQ, NDMP Backup 4 Industry-Leading Storage Efficiency – >80% Storage Utilization 5 Independent Scalability – Add compute & storage separately 6 Multi-Protocol – Industry standard protocols – NFS, CIFS, FTP, HTTP, HDFS

Hadoop Architecture with Isilon R (RHIPE) PIG Compute Node Mahout Job Tracker Compute Node

Hadoop Architecture with Isilon R (RHIPE) PIG Compute Node Mahout Job Tracker Compute Node Hive HBase Name. Node Task Tracker Compute Node name node Compute Node name node data node Ethernet Data. Node

Can you really do big data with SQL Server 2016+ ? • • Yes

Can you really do big data with SQL Server 2016+ ? • • Yes you can, to a point. Poly. Base and a Hadoop Distributed File System (HDFS) allows you to get started with Big Data Many organizations use Dell EMC Isilon storage, which supports HDFS This is a reasonable, workable solution, tested with 2. 5 TB of unstructured marketing analytics (clickstream) data

Your Next Step: Big data pilot project • • Acquire some space on your

Your Next Step: Big data pilot project • • Acquire some space on your organization’s Isilon cluster. Alternatively, you could look into Azure Blob Storage and Azure Compute (SQL Server & Windows Server). Load your data sets into your newly acquired storage Install SQL Server 2016+ (Head) and one or two Compute Nodes Create your External Data Source, External File Format, External Tables and start experimenting

https: //www. emc. com/products-solutions/trial-software-download/isilon. htm

https: //www. emc. com/products-solutions/trial-software-download/isilon. htm

Special Thanks to: • • • Dell EMC Big Data team - @Dell. EMCbigdata

Special Thanks to: • • • Dell EMC Big Data team - @Dell. EMCbigdata Christian Scharrer - Dell EMC Senior Systems Engineer Rob Sonders – Dell EMC Microsoft Specialist – SQL Server: @Robert. Sonders Michael Wells - Dell EMC Senior Systems Engineer: @Sql. Tech. Mike Dell EMC - Denver, CO office - Isilon Engineering Team: @Dell. EMCStorage

Thank You Sponsors Platinum Gold Silver Bronze Swag Venue

Thank You Sponsors Platinum Gold Silver Bronze Swag Venue