Architecting Virtualized Infrastructure for Big Data Richard Mc
Architecting Virtualized Infrastructure for Big Data Richard Mc. Dougall @richardmcdougll CTO, Application Infrastructure, Big Data Lead, VMware, Inc © 2009 VMware Inc. All rights reserved
Cloud: Big Shifts in Simplification and Optimization 1. Reduce the Complexity 2. Dramatically Lower Costs 3. Enable Flexible, Agile IT Service Delivery to simplify operations and maintenance to redirect investment into value-add opportunities to meet and anticipate the needs of the business 2
Infrastructure, Apps and now Data… Build Private Run Public Manage Simplify Infrastructure With Cloud 3 Simplify App Platform Through Paa. S Simplify Data
Trend 1/3: New Data Growing at 60% Y/Y Exabytes of information stored 20 Zetta by 2015 1 Yotta by 2030 audio digital tv Yes, you are part of the yotta generation… digital photos camera phones, rfid medical imaging, sensors satellite images, games, scanners, twitter cad/cam, appliances, videoconfercing, digital movies Source: The Information Explosion, 2009 4
Data Growth in the Enterprise 5
Trend 2/3: Big Data – Driven by Real-World Benefit 6
Trend 3/3: Value from Data Exceeds Hardware Cost § Value from the intelligence of data analytics now outstrips the cost of hardware • Hadoop enables the use of 10 x lower cost hardware • Hardware cost halving every 18 mo Value Big Iron: $40 k/CPU Commodity Cluster: $1 k/CPU Cost 7
A Holistic View of a Big Data System: Real Time Streams Real-Time Processing (s 4, storm) Analytics ETL Real Time Structured Database (h. Base, Gemfire, Cassandra) Big SQL (Greenplum, Aster. Data, Etc…) Unstructured Data (HDFS) 8 Batch Processing
Big Data Frameworks and Characteristics Framework Scale of data Scale of Cluster Computable Local Data? Disks? File System: 10 s PB 100 s No Yes, for cost 100 s PB 1, 000 s Yes, for cost and bandwidth PB’s 100 s No Yes, for cost and bandwidth Trilions Of rows 100 s Future Yes, for cost and availability Billions of rows 10 s-100 s Hybrid Possible Primarily Memory Gluster, Isilon, etc, … Map-reduce: Hadoop Big-SQL: Greenplum, Aster Data, Netezza, … No-SQL: Cassandra, h. Base, … In-Memory: Redis, Gemfire, Membase, … 9
The Unified Analytics Cloud Platform Madlib Data Meer Hadoop Python Cassandra Developer Frameworks Spring Paa. S Cloudfoundry HDFS Greenplum Database/Data. Store Data-Director EMC Chorus Data Platform v. Sphere 10 Karmasphere Tableau Analytics Tools Cloud Infrastructure h. Base Voldemort Data Paa. S Private Public
Unifying the Big Data Platform using Virtualization § Goals • • • Make it fast and easy to provision new data Clusters on Demand Allow Mixing of Workloads Leverage virtual machines to provide isolation (esp. for Multi-tenant) Optimize data performance based on virtual topologies Make the system reliable based on virtual topologies § Leveraging Virtualization • Elastic scale • Use high-availability to protect key services, e. g. , Hadoop’s namenode/job tracker • Resource controls and sharing: re-use underutilized memory, cpu • Prioritize Workloads: limit or guarantee resource usage in a mixed environment 11
A Unified Analytics Cloud Significantly Simplifies § Simplify • Single Hardware Infrastructure • Faster/Easier provisioning SQLCluster Big SQL No. SQL Hadoop No. SQL Cluster Unifed Analytics Infrastructure Private Public Hadoop Cluster § Optimize Decision Support Cluster 12 • Shared Resources = higher utilization • Elastic resources = faster on-demand access
Use Local Disk where it’s Needed 13 SAN Storage NAS Filers Local Storage $2 - $10/Gigabyte $1 - $5/Gigabyte $0. 05/Gigabyte $1 M gets: 0. 5 Petabytes 200, 000 IOPS 1 Gbyte/sec $1 M gets: 1 Petabyte 400, 000 IOPS 2 Gbyte/sec $1 M gets: 20 Petabytes 10, 000 IOPS 800 Gbytes/sec
VMware is Commited to the Best Virtual platform for Hadoop § Performance Studies and Best Practices • Studies through 2010 -2011 of Hadoop 0. 20 on v. Sphere 5 • White paper, including detailed configurations and recommendations § Making Hadoop run well on v. Sphere • • Performance optimizations in v. Sphere releases VMware engagement in Hadoop Community effort Supporting key partners with their distibutions on v. Sphere Contributing enhancements to Hadoop § Hadoop Framework Integration • Spring Hadoop: Enabling Spring to simplify Map-Reduce Programming • Spring Batch: Sophisticated batch management (Oozie on steroids) 14
Extend Virtual Storage Architecture to Include Local Disk § Shared Storage: SAN or NAS § Hybrid Storage Host 15 Host • SAN for boot images, VMs, other workloads Host Other VM Hadoop Other VM Hadoop • Local disk for Hadoop & HDFS • Scalable Bandwidth, Lower Cost/GB Other VM Hadoop Other VM Hadoop • Easy to provision • Automated cluster rebalancing
Performance Analysis of Big Data (Hadoop) on Virtualization Ratio of time taken – Lower is Better 1. 2 Ratio to Native 1 0. 8 0. 6 1 VM 0. 4 2 VMs 0. 2 16 TB 3. 5 Va lid at or Te ra ra S Te e t 3. 5 en G ra TB TB TB 1 Te lid at Va Te ra S or e t 1 1 en G ra Te TB TB ad -re IO FS D st Te Te st D FS IO -w rit e Pi 0 Tested on v. Sphere 5. 0
Simplify Hetrogeneous Data Management via Data Paa. S Filesystem Large. Scale No. SQL In. Memory Big SQL Analytics Tools Developer Databases Data Platform Cloud Infrastructure Data Paa. S – Common Data Management Layer Provisioning Multi-tenancy Management Data Discovery Cloud Infrastructure 17 Import/Export
v. Fabric Data Director Powers Database-as-a-Service New Applications Existing Applications v. Fabric Data Director DBA App Dev DBA IT Admin Automation Self-Service Provisioning Backup/ Restore Clone One click HA Policy Based Control Resource Mgmt Security Mgmt Database Templates Monitor VMware v. Sphere 18
Data Systems: Databases, file systems Analytics Tools Unstructured Structured Developer Databases Data Platform Cloud Infrastructure 19 Filesystem Large. Scale No. SQL In. Memory Big SQL
Technology: Databases and Data Stores for Big Data Unstructured Structured Filesystem Large. Scale No. SQL Types of Data Log files, machine generated data, documents, device data, etc… Loosely typed device data, records, events, statistics, complex relations/graphs Structured, partitionable data Structured data Technologies NAS, HDFS, Blob (S 3, Atmos, etc. . ) Cassandra, h. Base, Voldemort Gemfire, Redis, Membase Greenplum, Sybase IQ, Aster Data, etc, . Values Store any data, easy to scale-out, can optimize for cost Easy to scale-out, flexible and dynamic schema’s High Throughput, low latency High performance for repetitive queries. Ease of query language. 20 In. Memory Big SQL
Simplified Developer Experience through Paa. S Analytics Tools Developer Databases Data Platform Cloud Infrastructure 21 Platform as a Service
Spring Big Data Integrations § No. SQL Integration • Spring data for Mongo. DB, Gemfire, Riak, Neo 4 j, Blob, Cassandra § Spring Hadoop • Announced this week at Strata! • Provides support for developing applications based on Hadoop technologies by leveraging the capabilities of the Spring ecosystem. § Spring Batch • Integration allows Hadoop jobs and HDFS operations as part of workflow 22
The Unified Analytics Cloud Platform Madlib Data Meer Hadoop Python Cassandra Developer Frameworks Spring Paa. S Cloudfoundry HDFS Greenplum Database/Data. Store Data-Director EMC Chorus Data Platform v. Sphere 23 Karmasphere Tableau Analytics Tools Cloud Infrastructure h. Base Voldemort Data Paa. S Private Public
Summary § Revolution in Big Data is under way • Data centric applications are now critical § Hadoop on Virtualization • Proven performance • Cloud/Virtualization values apparent for Hadoop use § Simplify through a Unified Analytics Cloud • • 24 One Platform for today’s and future big-data systems Better Utilization Faster deployment, elastic resources Secure, Isolated, Multi-tenant capability for Analytics
References § Twitter • @richardmcdougll § My CTO Blog • http: //communities. vmware. com/community/vmtn/cto/cloud § Hadoop on v. Sphere • Talk @ Hadoop World • Performance Paper – http: //www. vmware. com/files/. . . /VMW-Hadoop-Performance-v. Sphere 5. pdf § Spring Hadoop • http: //blog. springsource. org/2012/02/29/introducing-spring-hadoop 25
- Slides: 25