Data Pipeline Frameworks The Dream and the Reality

  • Slides: 43
Download presentation
Data Pipeline Frameworks The Dream and the Reality Data. Eng. Conf, NYC Nov. 9,

Data Pipeline Frameworks The Dream and the Reality Data. Eng. Conf, NYC Nov. 9, 2018 Powering the Programmatic Cloud Mark Weiss Senior Software Engineer, Beeswax mark@beeswax. com Linked. In: @marksweiss

Topics ● About Beeswax and Our Pipelines ● Building an Example Pipeline ● Reality

Topics ● About Beeswax and Our Pipelines ● Building an Example Pipeline ● Reality Checks ● What Do Data Pipeline Frameworks Give You? ● The Dream

About Beeswax and Our Pipelines

About Beeswax and Our Pipelines

The Beeswax Programmatic Platform Ad Exchanges Millions of requests per second Bid Request Bidder

The Beeswax Programmatic Platform Ad Exchanges Millions of requests per second Bid Request Bidder Bid Response 99 th percentile response time < 30 ms

Beeswax Data Pipelines: Processing at Scale Bid Request Bid Response Kinesis ETL Impression Click/

Beeswax Data Pipelines: Processing at Scale Bid Request Bid Response Kinesis ETL Impression Click/ Conversion Event Processing S 3 AWS Data Pipeline Data Warehouse Redshift Data Lake EMR/ Hive Airflow

Beeswax Data Pipelines: Processing at Scale Bid Request Bid Response Kinesis ETL Impression Click/

Beeswax Data Pipelines: Processing at Scale Bid Request Bid Response Kinesis ETL Impression Click/ Conversion Event Processing S 3 AWS Data Pipeline Data Warehouse Redshift Data Lake EMR/ Hive Airflow

Beeswax Data Pipelines: Processing at Scale Bid Request Bid Response Kinesis ETL Impression Click/

Beeswax Data Pipelines: Processing at Scale Bid Request Bid Response Kinesis ETL Impression Click/ Conversion Data Warehouse Event Processing S 3 AWS Data Pipeline Redshift Data Lake 1% Sampled Auctions Billions / Day Bid Requests 10's of Billion / Day Impressions 100's of Million / Day EMR/ Hive Airflow

Beeswax Data Pipelines: The Scale is Growing

Beeswax Data Pipelines: The Scale is Growing

Building an Example Pipeline

Building an Example Pipeline

The Example Pipeline: Building a Datalake S 3 Data Warehouse Data Lake AWS Data

The Example Pipeline: Building a Datalake S 3 Data Warehouse Data Lake AWS Data Pipeline Redshift Hive/ EMR Airflow

The Frameworks: Data Pipeline vs. Airflow vs.

The Frameworks: Data Pipeline vs. Airflow vs.

The Frameworks: Data Pipeline vs. Airflow

The Frameworks: Data Pipeline vs. Airflow

The Example Pipeline: We Need Operators S 3 Data Warehouse Data Lake AWS Data

The Example Pipeline: We Need Operators S 3 Data Warehouse Data Lake AWS Data Pipeline Redshift Hive/ EMR Airflow

The Example Pipeline: We Need Operators S 3 Data Warehouse Data Lake AWS Data

The Example Pipeline: We Need Operators S 3 Data Warehouse Data Lake AWS Data Pipeline Redshift Hive/ EMR Airflow

The Example Pipeline: We Need Operators S 3 Data Warehouse Data Lake AWS Data

The Example Pipeline: We Need Operators S 3 Data Warehouse Data Lake AWS Data Pipeline Redshift Hive/ EMR Airflow

The Example Pipeline: We Need Operators S 3 Data Warehouse Data Lake AWS Data

The Example Pipeline: We Need Operators S 3 Data Warehouse Data Lake AWS Data Pipeline Redshift Hive/ EMR Airflow

The Example Pipeline: We Need Operators Paste your code here

The Example Pipeline: We Need Operators Paste your code here

The Example Pipeline: We Still Need Operators S 3 ? Data Lake Hive/ EMR

The Example Pipeline: We Still Need Operators S 3 ? Data Lake Hive/ EMR AWS Data Pipeline ? Airflow ? ? Data Warehouse Redshift

Summary: Why We Still Need Operators ● Airflow ○ Code does not support authentication

Summary: Why We Still Need Operators ● Airflow ○ Code does not support authentication the way we do it ○ Code does not have features we need and can't be parameterized to support them ● AWS ○ Need custom tooling to support proper versioning ○ Need to write the Operator code ourselves

The Example Pipeline: Can't Extend Operators AWS Data Pipeline Airflow

The Example Pipeline: Can't Extend Operators AWS Data Pipeline Airflow

Example Pipeline: Can't Integrate Existing Code https: //xkcd. com/353/

Example Pipeline: Can't Integrate Existing Code https: //xkcd. com/353/

(First) Reality Check

(First) Reality Check

Reality Check: Expect to Implement Your Own Operators The more you look at existing

Reality Check: Expect to Implement Your Own Operators The more you look at existing operator code, the more issues you'll find

Reality Check: Operators Require Flexible Config ● The strength of Data Pipelines is generic

Reality Check: Operators Require Flexible Config ● The strength of Data Pipelines is generic reusable code driven by config. . . ● … but this means every behavior you want to control has to expose arguments that you can set values for in config db_con = {host: "abc", port: "123"} Config redshift_op = Redshift. Op(db_conn) DAG class Redshift. Op(object) def __init__(self, db_conn): self. db_conn = db_conn Code

Reality Check: How to Implement Your Own Operators Beeswax. Redshift. To. S 3 Operator

Reality Check: How to Implement Your Own Operators Beeswax. Redshift. To. S 3 Operator Beeswax. Base. Operator Airflow. Operator

Reality Check: How to Implement Your Own Operators Beeswax. Redshift. To. S 3 Operator

Reality Check: How to Implement Your Own Operators Beeswax. Redshift. To. S 3 Operator Beeswax. Base. Operator Airflow. Operator

Deploying an Example Pipeline

Deploying an Example Pipeline

Deploying Example Pipeline S 3 Data Warehouse Data Lake AWS Data Pipeline Redshift Hive/

Deploying Example Pipeline S 3 Data Warehouse Data Lake AWS Data Pipeline Redshift Hive/ EMR Airflow

Deploying an Example Pipeline Config DAG Code

Deploying an Example Pipeline Config DAG Code

Deploying an Example Pipeline: Airflow Config DAG Code

Deploying an Example Pipeline: Airflow Config DAG Code

Deploying an Example Pipeline: AWS Data Pipeline Config DAG Code AWS Data Pipeline Taskrunner

Deploying an Example Pipeline: AWS Data Pipeline Config DAG Code AWS Data Pipeline Taskrunner

(Second) Reality Check

(Second) Reality Check

Reality Check: You Don't Have "No Ops" Config DAG Code

Reality Check: You Don't Have "No Ops" Config DAG Code

What Do Data Pipelines Give You?

What Do Data Pipelines Give You?

Why Use a Data Pipeline Framework? ● Problems are general and easy to get

Why Use a Data Pipeline Framework? ● Problems are general and easy to get wrong ● You need more than just DAGs

Why Use a Data Pipeline Framework?

Why Use a Data Pipeline Framework?

What Do Data Pipeline Frameworks Give You? Scheduling Start DAG Definition APIs End

What Do Data Pipeline Frameworks Give You? Scheduling Start DAG Definition APIs End

What Do Data Pipeline Frameworks Give You? Design and APIs to configure pipelines and

What Do Data Pipeline Frameworks Give You? Design and APIs to configure pipelines and define a runtime environment Config DAG Code

What Do Data Pipeline Frameworks Give You? ● Handling Task Failures ● Retries ●

What Do Data Pipeline Frameworks Give You? ● Handling Task Failures ● Retries ● Backfill

What Do Data Pipeline Frameworks Give You? UI for Status and Management

What Do Data Pipeline Frameworks Give You? UI for Status and Management

The Data Pipeline Platform Start { 'database': MY_DB 'table': 'impressions' } End

The Data Pipeline Platform Start { 'database': MY_DB 'table': 'impressions' } End

The Data Pipeline Dream

The Data Pipeline Dream

Questions? We are growing fast and hiring! Visit our table, find out more and

Questions? We are growing fast and hiring! Visit our table, find out more and score cool Bee-swag ⇪⇪ �� Talk to me at office hours following the talk Contact me: mark@beeswax. com , Linkded. In: @marksweiss