Architecting an Edge to Core to Cloud Data

  • Slides: 19
Download presentation
Architecting an Edge to Core to Cloud Data Pipeline Unify Analytics Engines with an

Architecting an Edge to Core to Cloud Data Pipeline Unify Analytics Engines with an In-Place Data Pipeline Santosh Rao Senior Technical Director, Net. App Mar 7 2018

Early Generation Big Data Analytics Platform § Designed to Deliver Initial Analytics Solutions §

Early Generation Big Data Analytics Platform § Designed to Deliver Initial Analytics Solutions § Scalability, Availability, Governance are after thoughts § Typical Approach – Cloud or Commodity Infrastructure § Leading to Unpredictable Ro. I as Copies Manifest § 3 -5 Replicas copied across Lo. B, Functions 2 Big Data Analytics Software § Primary Considerations – Cost & Agility

Early Generation Analytics Platform Challenges File Storage Replication File Copies G G G Storage

Early Generation Analytics Platform Challenges File Storage Replication File Copies G G G Storage Demon Unpredictable performance Inefficient Storage utilization Not enterprise ready Total cost of ownership 3 Media and node failures Storage and compute tied (creates imbalance)

Evolving To A Data Pipeline 4

Evolving To A Data Pipeline 4

Extending the Data Pipeline from Edge to Core to Cloud Data Lifecycle Challenges Edge

Extending the Data Pipeline from Edge to Core to Cloud Data Lifecycle Challenges Edge Core Public Cloud Initial point of data collection and aggregation Dedicated hardware, private cloud deployments Hosted as-a-service solutions; long-term data archival Multicloud 5

Traditional Data Architecture Issues with Traditional on-premises Data workflow and Commodity Infrastructure Siloed architecture

Traditional Data Architecture Issues with Traditional on-premises Data workflow and Commodity Infrastructure Siloed architecture Traditional Lambda architecture: Data processing pipeline Users Web tier/ App server Stream analytics Traditional data warehouse Io. T Data Hadoop Data Lake Inflexible | Months to upgrade systems 6 Poor utilization | High TCO | Fixed compute to storage ratio § Analytics § Services § Reports No. SQL Inability to access the cloud Data management challenges

Next Generation Data Pipeline § Unified Insights across Lines of Business and Functions -

Next Generation Data Pipeline § Unified Insights across Lines of Business and Functions - Unified Enterprise Data Lake - Federate Data Sources across 2 nd and 3 rd Platform - In-Place Access to Data Pipeline (Copy Avoidance) - Future Proof to allow shifts in Architecture § Deployment – Po. C at Lo. B, Scale for Production Use - Scale Edge, Core and Cloud as a Single Pipeline - Governance, Data Protection, Security on Data Pipeline § Lowest TCO over life of Solution 7

Meet Diverse Needs across Enterprise Functions Data Scientists Data Architects Data/IT Admins Real-world Data

Meet Diverse Needs across Enterprise Functions Data Scientists Data Architects Data/IT Admins Real-world Data for App Dev Future Proof Architecture Lowest TCO in face of shrinking budgets Need Agile Infrastructure : Seek Extensible Architecture: Balance Cost & TTM : § Refreshed access to Production Data § Architecture spans Edge, Core and Cloud § Reduce IT/Licensing costs § Enable Dev. Ops, Data Scientist § Future Proof to allow shifts in deployment § Enable Multi Tenant Data Science § Tiering is the new Scaling § Data, AI/ML Models as Code § No Data and Compute Sprawl § Automated Data Lifecycle Management § Meet SLAs § Non-Disruptive Operations § Architect for overall TCO § API Based Balanced Architecture to Deliver for Stakeholders

Data Pipeline Requirements Consider needs for each stage of the typical data pipeline Edge

Data Pipeline Requirements Consider needs for each stage of the typical data pipeline Edge Data Ingest, Collection, Transport § Massive data (few TB/device/day) § Real-time Edge Analytics/AI § Ultra Low Latency § Network Bandwidth § Smart Data Movement 9 Core Data Lake, AI Model Training, Deployment, Serving § Ultra High I/O bandwidth (20 200+ GBps) § Ultra-low latency (micro – nanosec) § Linear Scale (1 -128 node AI) § Overall TCO for 1 -100+ PB Cloud Data Lake, Training, Deployment, GPU as a Service, Archive § Cloud Analytics, AI/DL/ML § Consume and not Operate § Cloud Vendor vs. On-Prem stack § Cost-effective Archive § Need to avoid Cloud Lock-in

AI Infrastructure Guiding Principals: 5 Factors Smoothing the flow of the Data Pipeline 1.

AI Infrastructure Guiding Principals: 5 Factors Smoothing the flow of the Data Pipeline 1. Choice of Filesystem (Lustre, HDFS, GPFS, NFS) 2. Ability to federate diverse data sources both structured and un-structured 3. Smart Data Movement 4. Data as a Service 5. Leading Edge Performance (NVMe, NVMeo. F, NVDIMM, 3 DXPoint) 10

Tech Topics: 10 Dimensions of AI Infrastructure Smoothing the flow of the Data Pipeline

Tech Topics: 10 Dimensions of AI Infrastructure Smoothing the flow of the Data Pipeline 1. Both Random and Sequential IO 6. Copy Avoidance 2. Ultra Low Latency for Inference 7. In-Place Access to Data (HPC, Analytics, AI/DL) 3. Ultra High Bandwidth/Parallelism for Training 4. Linear Scale Single Namespace & Metadata 5. Client side Caching for Iterative Scans 11 8. Availability of Smart Data Movement 9. Ease of management at Scale 10. Service Levels for Model Serving

Our Solution: Net. App Data Pipeline Extends from Edge to Core to Cloud. Federates

Our Solution: Net. App Data Pipeline Extends from Edge to Core to Cloud. Federates Data Sources, Compute Engines and Clouds. Io. T Data BD Cluster AI/DL Cluster HDFS-NFS In-Place Analytics In-Place AI/DL HDInsight Databricks/EMR Net. App HDFS-NFS Connector Express. Route HDFS/NFS Archive Direct Connect Data Fabric Unified Data Lake 12 Net. App Cloud Volumes Secure Data

Data Pipeline for Artificial Intelligence / Deep Learning Edge Data Aggregation Core Data Lake

Data Pipeline for Artificial Intelligence / Deep Learning Edge Data Aggregation Core Data Lake Cloud High Performance Dev. Ops Cloud AI/DL Data Fabric Ingest Data Prep Training Cluster Deployment Training Set 1 IM 1 Training Set 2 Data Lake IM 2 Training Set 3 IM 3 Test Set • Data collection • Edge-level AI (Tensor. RT) 13 • Aggregation • Normalization • Raw I/O bandwidth • Parallel streams • Ultra-low latency • Model Serving Public or Private Cloud Archive

Hybrid Cloud Data Pipeline Cloud Edge Near the Cloud Small Footprint Smart Data Movement

Hybrid Cloud Data Pipeline Cloud Edge Near the Cloud Small Footprint Smart Data Movement Compute & Data Separation Ingest Deployment Training Set 1 AI / Deep learning in the cloud Deployment IM 1 Direct Connect Training Set 2 Data lake Training Set 3 IM 2 IM 3 Express Route Test Set Public or Private Cloud Archive Cold data, Backup, Clone 14 Data as a Service, Near Cloud Data, Data Tiering, Cloud AI, GPU as a Service

Key Takeaways Net. App Data Pipeline : Building Blocks for your entire data flow—From

Key Takeaways Net. App Data Pipeline : Building Blocks for your entire data flow—From Edge to Core to Cloud Data Fabric for AI Edge Core Cloud Smart Data Mover ONTAP Select Accelerate DL ONTAP 9 AFF Future Proof ONTAP Cloud Ultra-High Bandwidth 15 Ultra-Low Latency © 2018 Net. App, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---

Thank You 16

Thank You 16

Backup 17

Backup 17

Solution Architecture for the Data Pipeline Edge Core Cloud AFF Flexpod for DGIX/Plexistor Data

Solution Architecture for the Data Pipeline Edge Core Cloud AFF Flexpod for DGIX/Plexistor Data Prep On-Premises Deep Learning Pipeline 12 3 Training Sets ONTAP Select Cloud-Based Deep Learning Pipeline ONTAP Select 18 Training Cluster Data Lake Training Sets Deployment IM 1 IM 2 IM 3 Repo

IDC Study : Net. App vs. Commodity TCO 19

IDC Study : Net. App vs. Commodity TCO 19