Realtime Structured Streaming in Azure Databricks Brian Steele

Realtime Structured Streaming in Azure Databricks Brian Steele - Principal Consultant bsteele@pragmaticworks. com

Your Current Situation • You currently have high volume data that you are processing in a batch format • You are you trying to get real-time insights from your data • You have great knowledge of your data, but limited knowledge on of Azure Databricks or other Spark systems

Prior Architecture

New Architecture

Why Azure Databricks? • Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform. • Designed with the founders of Apache Spark, Databricks is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts. • Azure Databricks is a fast, easy, and collaborative Apache Spark-based analytics service.

• For a big data pipeline, the data (raw or structured) is ingested into Azure through Azure Data Factory in batches, or streamed near realtime using Kafka, Event Hub, or Io. T Hub. • This data lands in a data lake for long term persisted storage, in Azure Blob Storage or Azure Data Lake Storage. • As part of your analytics workflow, use Azure Databricks to read data from multiple data sources such as Azure Blob Storage, Azure Data Lake Storage, Azure Cosmos DB, or Azure SQL Data Warehouse and turn it into breakthrough insights using Spark. • Azure Databricks provides enterprise-grade Azure security, including Azure Active Directory integration, role-based controls, and SLAs that protect your data and your business.

Advantages of Structured Streaming • Structured Streaming is the Apache Spark API that lets you express computation on streaming data in the same way you express a batch computation on static data. • The Spark SQL engine performs the computation incrementally and continuously updates the result as streaming data arrives. • Databricks maintains the current checkpoint of the data processed, making restart after failure nearly seamless. • Can bring impactful insights to the users in almost realtime.

Streaming Data Source/Sinks Sources Azure Event Hubs/IOT Hubs Databricks Delta Tables Azure Data Lake Gen 2 (Auto Loader) Apache Kafka Amazon Kinesis Amazon S 3 with Amazon SQS Sinks Databricks Delta Tables Almost any Sink using foreach. Batch

Structured Streaming • Source Parameters • Source Format/Location • Batch/File Size • Transformations • Streaming data can be transformed in the same ways as static data • Output Parameters • Output Format/Location • Checkpoint Location

DEM O

Join Operations

Stream-Static Joins • Join Types • Inner • Left • Not Stateful by default

ME

Stream-Stream Joins • Join Types • Inner (Watermark and Time Constraint Optional) • Left Outer (Watermark and Time Constraint Req) • Right Outer (Watermark and Time Constraint Req) • You can also Join Static Tables/Files into your Stream -Stream Join

Watermark vs. Time Constraint • Watermark – How late a record can arrive and after what time can it be removed from the state. • Time Constraint – How log the records will be kept in state in relation to the other stream • Only used in stateful operation • Ignored in non-stateful streaming queries and batch queries

DEM O

foreach. Batch • Allows Batch Type Processing to be performed on Streaming Data • Perform Processes with out adding to state • drop. Duplicates • Aggregating data • Perform a Merge/Upsert with Existing Static Data • Write Data to multiple sinks/destinations • Write Data to sinks not support in Structured Streaming

ME

Going to Production • Spark Shuffle Partitions – • Equal to the number of cores on the Cluster • Maximum Records per Micro-Batch • File Source/Delta Lake – max. Files. Per. Trigger, max. Bytes. Per. Trigger • Event. Hubs – max. Events. Per. Trigger • Limit Stateful – limits state and memory errors • Watermarking • MERGE/Join/Aggregation • Broadcast Joins • Output Tables – Influences downstreams • Manually re-partition • Delta Lake – Auto-Optimize

Conclusion

Have Any Questions?