Oskari Heikkinen Introduction to Azure Data Lake Sponsors

  • Slides: 40
Download presentation
Oskari Heikkinen Introduction to Azure Data Lake

Oskari Heikkinen Introduction to Azure Data Lake

Sponsors General Information

Sponsors General Information

Oskari Heikkinen • • Director, Microsoft Azure at CGI Microsoft P-TSP oskari. heikkinen@cgi. com

Oskari Heikkinen • • Director, Microsoft Azure at CGI Microsoft P-TSP oskari. heikkinen@cgi. com +358 40 561 8481 Cloud Analytics https: //www. linkedin. com/in/oskariheikkinen/ General Information

Compute Storage General Information

Compute Storage General Information

General Information

General Information

Azure Data Lake Background: Cosmos at Microsoft General Information

Azure Data Lake Background: Cosmos at Microsoft General Information

Azure Data Lake Storage (Gen 1) A hyper scale repository for big data analytics

Azure Data Lake Storage (Gen 1) A hyper scale repository for big data analytics workloads Store ANY DATA in its native format HADOOP FILE SYSTEM (HDFS) for the cloud ENTERPRISE GRADE No limits to SCALE Optimized for analytic workload PERFORMANCE General Information

Data Lake Storage (Gen 1): Basics Unlimited Storage • Unlimited store account size •

Data Lake Storage (Gen 1): Basics Unlimited Storage • Unlimited store account size • Individual files can be size of petabytes Optimized for Analytics Built for running analytics systems that require massive throughput Optimized for parallel computation over petabytes of data High Availability • Automatically replicates your data • Three copies within a single region • 99, 9% SLA General Information

Data Lake Storage (Gen 1): Data Security Encryption TLS for Data in Transit Transparent

Data Lake Storage (Gen 1): Data Security Encryption TLS for Data in Transit Transparent server-side encryption Service managed keys or Azure Key Vault and customer-managed keys Authentication & authorization • Azure Active Directory • POSIX-style Access Control Lists on folders and files Auditing • Audit logs for all operations • Audit logs can be analysed with U-SQL General Information

Data Lake Storage (Gen 1) Files are split into Extents. A LARGE FILE Extents

Data Lake Storage (Gen 1) Files are split into Extents. A LARGE FILE Extents can be up to 2 GB in size. 1 2 3 4 For availability and reliability, extents are replicated (3 copies). 1 2 3 4 Enables parallelized read General Information

Large files provide parallelism opportunities Extent Vertex Extent Vertex General Information

Large files provide parallelism opportunities Extent Vertex Extent Vertex General Information

Parallel writing Front-end machines for a web service Log files Simultaneous uploads Azure Data

Parallel writing Front-end machines for a web service Log files Simultaneous uploads Azure Data lake General Information

General Information

General Information

Azure Data Lake Storage (Gen 1) Architecture General Information

Azure Data Lake Storage (Gen 1) Architecture General Information

Key takeaway? General Information

Key takeaway? General Information

Data Lake Storage Gen 1 Azure Blob Storage Optimized for Analytics General purpose bulk

Data Lake Storage Gen 1 Azure Blob Storage Optimized for Analytics General purpose bulk storage Hierarchy on File System Flat namespace object store Size limits No* ~4, 77 TB per file, 500 TB per storage account Geo-redundancy LRS, ZRS, GRS, RA-GRS HDFS Client Yes Scenarios Structure General Information

Authentication Authorization Data Encryption Connection protocols Firewall Data Lake Storage Gen 1 Azure Blob

Authentication Authorization Data Encryption Connection protocols Firewall Data Lake Storage Gen 1 Azure Blob Storage Azure Active Directory Access Keys / SAS Tokens POSIX-style ACLs Access Keys / SAS Tokens Transparent Server-side Encryption Storage Service Encryption HTTPS HTTP / HTTPS Yes General Information

Data Lake Storage Gen 2

Data Lake Storage Gen 2

Data Lake Gen 2: Combining the best of both? General Information

Data Lake Gen 2: Combining the best of both? General Information

Data Lake Gen 2: Combining the best of both? General Information

Data Lake Gen 2: Combining the best of both? General Information

Data Lake Gen 2: Combining the best of both? General Information

Data Lake Gen 2: Combining the best of both? General Information

Blob Storage Authentication Data Lake Gen 1 Data Lake Gen 2 Access Keys/SAS Tokens

Blob Storage Authentication Data Lake Gen 1 Data Lake Gen 2 Access Keys/SAS Tokens Azure AD Structure Flat namespace Hierarchical File System Both Size limits ~4, 77 TB per file, 500 TB per account No* ~4, 77 TB per file LRS, ZRS, GRS, RA-GRS Yes No Yes 16, 6€ / TB 32, 9€ / TB 16, 6€ / TB Geo-redundancy Hot/Cold Storage Tiers Price* *Prices per month in West Europe for LRS on 24. 2. 2019 General Information

Storage Best Practices - Design folder hierarchy structure - Split into several services -

Storage Best Practices - Design folder hierarchy structure - Split into several services - Service level limits - Gen 2: disaster recovery General Information

Services for processing Big Data

Services for processing Big Data

HDInsight

HDInsight

Azure HDInsight Hadoop as a Service on Azure Fully-managed Hadoop and Spark for the

Azure HDInsight Hadoop as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open Source Hortonworks data platform Cluster up and running in 20 minutes Supported by Microsoft with 99. 9% SLA Familiar BI tools for analysis Open source notebooks for interactive data science 63% lower TCO than deploying Hadoop on-premise* *IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight” General Information

History - Why do we have Big Data technologies today? Map. Reduce Data volume

History - Why do we have Big Data technologies today? Map. Reduce Data volume Petabyte scale Access mode Batch Updates Write once, read many Structure Schema-on-read Integrity Low Scaling Linear RDBMS Gigabyte scale Interactive, batch Write many, read many Schema-on-write High Nonlinear General Information

Apache Hive: Enterprise Data Warehousing General Information

Apache Hive: Enterprise Data Warehousing General Information

Execution engines and LLAP General Information

Execution engines and LLAP General Information

Azure Data. Bricks

Azure Data. Bricks

AZURE DATABRICKS § Azure Databricks is a first party service on Azure. • Unlike

AZURE DATABRICKS § Azure Databricks is a first party service on Azure. • Unlike with other clouds, it is not an Azure Marketplace or a 3 rd party hosted service. § Azure Databricks is integrated seamlessly with Azure services: • Azure Portal: Service an be launched directly from Azure Portal • Azure Storage Services: Directly access data in Azure Blob Storage and Azure Data Lake Store • Azure Active Directory: For user authentication, eliminating the need to maintain two separate sets of users in Databricks and Azure. • Azure SQL DW and Azure Cosmos DB: Enables you to combine structured and unstructured data for analytics • Apache Kafka for HDInsight: Enables you to use Kafka as a streaming Spark as a Service on Azure data source or sink • Azure Billing: You get a single bill from Azure • Azure Power BI: For rich data visualization § Eliminates need to create a separate account with Databricks. General Information

APACHE SPARK An unified, open source, parallel, data processing framework for Big Data Analytics

APACHE SPARK An unified, open source, parallel, data processing framework for Big Data Analytics § § § Yarn Mesos Spark Structured Streaming Standalone Scheduler Spark MLlib Machine Learning Stream processing General Information

GENERAL SPARK CLUSTER ARCHITECTURE Driver Program Spark. Context § ‘Driver’ runs the user’s ‘main’

GENERAL SPARK CLUSTER ARCHITECTURE Driver Program Spark. Context § ‘Driver’ runs the user’s ‘main’ function and executes the various parallel operations on the worker nodes. § The results of the operations are collected by the driver § The worker nodes read and write data from/to Data Sources including HDFS. Cluster Manager Worker Node § Worker node also cache transformed data in memory as RDDs (Resilient Distributed Datasets). § Worker nodes and the Driver Node execute as VMs in public clouds (AWS, Azure). Data Sources (HDFS, SQL, No. SQL, …) General Information

General Information

General Information

CATALYST QUERY OPTIMIZER General Information

CATALYST QUERY OPTIMIZER General Information

DEMO: HDInsight & Data. Bricks

DEMO: HDInsight & Data. Bricks

External Metastore General Information

External Metastore General Information

External Metastore General Information

External Metastore General Information

Call to Action - Read how these work: - HDFS - YARN - Spark

Call to Action - Read how these work: - HDFS - YARN - Spark - Learning by doing: Start playing around with the services General Information

Thank you!

Thank you!