Lustre Hadoop Accumulo Jeremy Kepner 1 2 3

Outline • Introduction – Volume, Velocity, Variety, Veracity • Big Data Storage APIs –

Common Big Data Challenge Operators Analysts Commanders Users Rapidly increasing - Data volume -

Common Big Data Architecture Operators Analysts Commanders Users Web Databases Ingest & Enrichment Ingest

Common Big Data Architecture - Data Volume: Various Clouds Operators Analysts Compute Cloud Enterprise

Common Big Data Architecture - Data Velocity: Accumulo Database Operators Analysts Commanders Users Web

Common Big Data Architecture - Data Velocity: Sci. DB for Dense Data Operators Analysts

Common Big Data Architecture - Data Variety: D 4 M Schema Operators Users Analysts

Graphulo. MIT. edu - Data Variety: Graph Analytics Operators Users Analysts Commanders Open source

Common Big Data Architecture - Data Veracity: Computing on Masked Data Operators Analysts Commanders

Example Big Data APIs and Megastacks Operators Berkeley Analysts Commanders Users Cloudera Web Databases

Example Big Database APIs Operators Analysts Commanders Users Web Databases Ingest & Enrichment Ingest

Example Supercomputing APIs Operators Analysts Commanders Users Web Databases Ingest & Enrichment Ingest Analytics

Lustre Parallel File System • High performance general purpose file system • Uses standard

Hadoop Distribute File System (HDFS) • Special purpose file system • Uses replication for

Accumulo Database Accumulo clients sub Base graph Graph tablet server Tablet tablet • High

How to Compare Storage • Full-scale head-to-head comparison of Big Data Storage systems is

Storage Capacity • Lustre – 100 x 6 TB x (0. 66 RAID) =

3 Disk Failure Data Loss Probability • Lustre – P 3 ≈ (nd P

Peak Read/Write Performance • Lustre – B-1 = (nc Bc)-1 + (Bn)-1 + (ns

Capability Estimate Summary Lustre Hadoop Acummulo Raw capacity 6 PB Usable capacity 4 PB

Mix n’ Match 1: Hadoop/Accumulo on Lustre • Hadoop Map/Reduce is popular – Many

Mix n’ Match 2: Accumulo Checkpoint on Lustre • Use Lustre to backup Accumulo

Mix n’ Match 3: Lustre Metadata in Accumulo • Lustre systems can easily have

Mix n’ Match 4: Map/Reduce on Lustre • LLMap. Reduce wraps schedulers (SGE, SLURM,

Summary • Storage systems are a critical part of Big Data systems • Lustre,

Slides: 28

Download presentation

Lustre, Hadoop, Accumulo Jeremy Kepner 1, 2, 3, William Arcand 1, David Bestor 1, Bill Bergeron 1, Chansup Byun 1, Lauren Edwards 1, Vijay Gadepally 1, 2, Matthew Hubbell 1, Peter Michaleas 1, Julie Mullen 1, Andrew Prout 1, Antonio Rosa 1, Charles Yee 1, Albert Reuther 1 1 MIT Lincoln Laboratory, 2 MIT Computer Science & AI Laboratory, 3 MIT Mathematics Department September 17, 2015 This material is based upon work supported by the National Science Foundation under Grant No. DMS 1312831. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Outline • Introduction – Volume, Velocity, Variety, Veracity • Big Data Storage APIs – Lustre, Hadoop, Accumulo • Modeling Storage Performance – Mix n’ Match Solutions • Lustre, Hadoop, Accumulo- 2 Summary

Common Big Data Challenge Operators Analysts Commanders Users Rapidly increasing - Data volume - Data velocity - Data variety - Data veracity (security) Data Gap Users 2000 Data 2005 2010 2015 & Beyond <html> OSINT Lustre, Hadoop, Accumulo- 3 Weather HUMINT C 2 Ground Maritime Air Space Cyber

Common Big Data Architecture Operators Analysts Commanders Users Web Databases Ingest & Enrichment Ingest Analytics Files Scheduler Computing Data <html> OSINT Lustre, Hadoop, Accumulo- 4 Weather HUMINT C 2 Ground Maritime Air Space Cyber

Common Big Data Architecture - Data Volume: Various Clouds Operators Analysts Compute Cloud Enterprise Cloud Users Analysts Commanders Four Major Ecosystems Operators MIT Web Super. Cloud Databases Testbed VMware Ingest & Ingest Hadoop & Enrichment Ingest MPI SQL Analytics Files Big Data Cloud Database Cloud Scheduler Computing Data <html> OSINT Weather HUMINT C 2 Ground Maritime LLSuper. Cloud: Sharing HPC Systems for Diverse Rapid Prototyping, Reuther et al, IEEE HPEC 2013 Lustre, Hadoop, Accumulo- 5 Air Space Cyber

Common Big Data Architecture - Data Velocity: Accumulo Database Operators Analysts Commanders Users Web Databases Ingest & Enrichment Ingest Analytics World Record holder in database performance Files Scheduler Computing Data <html> OSINT Lustre, Hadoop, Accumulo- 6 Weather HUMINT C 2 Ground Maritime Air Space Cyber Achieving 100, 000 database inserts per second using Accumulo and D 4 M, IEEE HPEC 2014

Common Big Data Architecture - Data Velocity: Sci. DB for Dense Data Operators Analysts Integrated Sci. DB; brings the power of databases to dense data Users SAR LIDAR SONAR HSI Ingest & Enrichment EO Enrichment Ingest Commanders Sci. DB LAT Web LON TIME HEIGHT Databases … Analytics - Dense data currently stored as raw, unindexed file - Sci. DB dramatically reduces time to exploit dense data Files Scheduler Computing Data <html> OSINT Lustre, Hadoop, Accumulo- 7 Weather HUMINT C 2 Ground Maritime Air Space Cyber

Common Big Data Architecture - Data Variety: D 4 M Schema Operators Users Analysts Commanders D 4 M demonstrated a universal approach to diverse data columns. Web raw Databases Ingest & Enrichment Analytics rows Ingest Files Scheduler Computing Σ Data <html> OSINT intel reports, DNA, health records, publication citations, web logs, social media, building alarms, cyber, … all handled by a common 4 table schema Weather HUMINT C 2 Ground Maritime D 4 M 2. 0 Schema: A General Purpose High Performance Schema for the Accumulo Database, Kepner et al, IEEE HPEC 2013 Lustre, Hadoop, Accumulo- 8 Air Space Cyber

Graphulo. MIT. edu - Data Variety: Graph Analytics Operators Users Analysts Commanders Open source library highlights Accumulo’s inherent graph capabilities columns. Web raw Databases Ingest & Enrichment Analytics rows Ingest Files Scheduler Computing Σ Data <html> OSINT Lustre, Hadoop, Accumulo- 9 Weather HUMINT C 2 Ground Maritime Air Space Cyber

Common Big Data Architecture - Data Veracity: Computing on Masked Data Operators Analysts Commanders Users Compute on Encrypted Data Web Databases Ingest & Enrichment Ingest Analytics Compute on Encrypted Data Files Scheduler Computing Data <html> OSINT Weather Lustre, Hadoop, Accumulo. Data 10 Encrypted Data Unencrypted HUMINT C 2 Ground Maritime Air Computing on Masked Data: a High Performance Method for Improving Big Data Veracity, IEEE HPEC 2014 Space Cyber

Example Big Data APIs and Megastacks Operators Berkeley Analysts Commanders Users Cloudera Web Databases Ingest & Enrichment Ingest Analytics Horton. Works Files Scheduler Computing Data <html> OSINT Lustre, Hadoop, Accumulo- 12 Weather HUMINT C 2 Ground Maritime Air Space Cyber

Example Big Database APIs Operators Analysts Commanders Users Web Databases Ingest & Enrichment Ingest Analytics Files Scheduler Computing Data <html> OSINT Lustre, Hadoop, Accumulo- 13 Weather HUMINT C 2 Ground Maritime Air Space Cyber

Example Supercomputing APIs Operators Analysts Commanders Users Web Databases Ingest & Enrichment Ingest Analytics Files Scheduler Computing Data <html> OSINT Lustre, Hadoop, Accumulo- 14 Weather HUMINT C 2 Ground Maritime Air Space Cyber

Lustre Parallel File System • High performance general purpose file system • Uses standard RAID for redundancy • Supports any parallel programming model Lustre, Hadoop, Accumulo- 15

Hadoop Distribute File System (HDFS) • Special purpose file system • Uses replication for redundancy (typically 3 x) • Java map/reduce programming model Lustre, Hadoop, Accumulo- 16

Accumulo Database Accumulo clients sub Base graph Graph tablet server Tablet tablet • High performance parallel database • Uses Hadoop as its file system Lustre, Hadoop, Accumulo- 17

How to Compare Storage • Full-scale head-to-head comparison of Big Data Storage systems is expensive and time-consuming • Much can be learned from simple systems analysis • Model System Parameters nc = 100 Number of compute nodes Bc = 1 GB/sec Compute node network link ns = 10 Number of central servers Bs = 4 GB/s Central server network link nd = 1000 Number of disks in systems Vd = 6 TB Disk capacity Bd = 0. 1 GB/sec Disk I/O Lustre, Hadoop, Accumulo- 19

Storage Capacity • Lustre – 100 x 6 TB x (0. 66 RAID) = 4 PB • Hadoop – 100 x 6 TB x (0. 33 Redundancy) = 2 PB • Accumulo on Hadoop – Same as Hadoop Lustre, Hadoop, Accumulo- 20

3 Disk Failure Data Loss Probability • Lustre – P 3 ≈ (nd P 1)3 / 100 where – P 3 = probability that 3 drives fail in the same OSS – P 1 = probability that a single drive fails • Hadoop – P 3 ≈ (nd P 1)3 where – P 3 = probability that 3 drives fail – P 1 = probability that a single drive fails • Accumulo on Hadoop – Same as Hadoop Lustre, Hadoop, Accumulo- 21

Peak Read/Write Performance • Lustre – B-1 = (nc Bc)-1 + (Bn)-1 + (ns Bs)-1 + (nd Bd)-1 – B = 22 GB/sec • Hadoop – Bwrite = min(nc, nd)Bd/R = 33 GB/sec – Bread = min(nc, nd)Bd/(1 + r) = 100 GB/sec (perfect load balancing) • Accumulo on Hadoop – B = 30 MB/s nc = 3 GB/sec Lustre, Hadoop, Accumulo- 22

Capability Estimate Summary Lustre Hadoop Acummulo Raw capacity 6 PB Usable capacity 4 PB 2 PB 3 Drive data loss probability (nd P 1 fail)3 / 100 (nd P 1 fail)3 Peak write 22 GB/s 33 GB/s Peak read 23 GB/s 100 GB/s 3 GB/s 1 day 3 hours 50 msec Find any string • Simple model allows estimates of Lustre, Hadoop, and Accumulo capabilites for a “common” system consisting of – 100 compute nodes, 1 GB/s links, and 1000 6 TB disks Lustre, Hadoop, Accumulo- 23

Mix n’ Match 1: Hadoop/Accumulo on Lustre • Hadoop Map/Reduce is popular – Many applications – Strong commercial interest in Lustre community 1 -6 file metadata: filename, permissions, … object: 010110 … • Principal benefits – Better $/byte – Better reliability – Support Map/Reduce and other applications file name node object: 010110 … data node – Accumulo tolerance to MDS load (actively being worked 7) on Lustre: Hadoop Performance in HPC Environments, Rutman, Xyratex, 2011 2 Hadoop Map. Reduce over Lustre, Kulkarni, Lustre User’s Group, 2013 3 Modernizing Hadoop Architecture for Superior Scalability, Efficiency & Productive Throughput, DDN, 2013 4 System Fabric Works Lustre Solutions for Hadoop Storage, 2014 5 Inside the Hadoop Workflow Accelerator, Seagate, 2014 6 Lustre, Hadoop, Accumulo. Hadoop Plugin, 24 Seagate, 2015 7 A Guide to Running Accumulo in a VM Environment, Fuchs, 2015 object storage server (OSS) metadata: filename, replicas, … • Principal challenge 1 Map/Reduce metadata server (MDS) data node Accumulo clients sub Base graph Graph tablet server Tablet tablet

Mix n’ Match 2: Accumulo Checkpoint on Lustre • Use Lustre to backup Accumulo – Launch on Hadoop as needed file metadata: filename, permissions, … object: 010110 … • Principal benefits – Start, stop, checkpoint, clone, migrate, restart an arbitrary number of Accumulo instances – Can develop at scale – Dynamic migration node file object storage server (OSS) metadata: filename, replicas, … name node object: 010110 … data node 100 node 100 TB Accumulo migration in ~1 hour data node Accumulo clients node sub Base graph Graph Lustre, Hadoop, Accumulo- 25 metadata server (MDS) tablet server Tablet tablet Prout et al, “Enabling On-Demand Database Computing with MIT Super. Cloud Database Management System, ” IEEE HPEC 2015.

Mix n’ Match 3: Lustre Metadata in Accumulo • Lustre systems can easily have 100 M+ files file metadata: filename, permissions, … – Metadata is difficult to analyze object: 010110 … • Accumulo can easily ingest this metadata – Ingested metadata on 50 M files in ~3 hours on a single Accumulo node – Complex metadata queries in minutes instead of days (“show all users who created over 100 50 MB files in March”) file object storage server (OSS) metadata: filename, replicas, … name node object: 010110 … data node Accumulo clients sub Base graph Graph Lustre, Hadoop, Accumulo- 26 metadata server (MDS) tablet server Tablet tablet

Mix n’ Match 4: Map/Reduce on Lustre • LLMap. Reduce wraps schedulers (SGE, SLURM, …) in familiar Map/Reduce syntax file metadata: filename, permissions, … metadata server (MDS) object: 010110 … • Supports all languages object storage server (OSS) • No changes to user program • Scales to 1000 s of cores – In production use for 3 years • Readily accepted by Java & Python Map/Reduce users • 3 Files, 300 lines of Python, can run/install user space • Parallel computing in 1 line of code; no change to user program LLMap. Reduce --input --output --mapper Mapper Lustre, Hadoop, Accumulo- 27 Byun et al, “Portable Map-Reduce Utility for MIT Super. Cloud Enviornment, ” IEEE HPEC 2015.

Summary • Storage systems are a critical part of Big Data systems • Lustre, Hadoop, and Accumulo are three important storage technologies • Full-scale head-to-head comparison of Big Data Storage systems is expensive and time-consuming • Much can be learned from simple systems analysis Lustre, Hadoop, Accumulo- 28