Workshop on Petascale Data Analytics Challenges and Opportunities

Twister Bingjing Zhang, Richard Teng Twister 4 Azure Thilina Gunarathne High-Performance Visualization Algorithms For

Dryad. LINQ CTP Evaluation Hui Li, Yang Ruan, and Yuduo Zhou Cloud Storage, Future.

Applications Support Scientific Simulations (Data Mining and Data Analysis) Kernels, Genomics, Proteomics, Information Retrieval,

What are the challenges? Providing both cost effectiveness parallel paradigms that These challenges must

Programming Models and Tools Map. Reduce in Heterogeneous Environment MICROSOFT 8

Motivation Data Deluge Experiencing in many domains Map. Reduce Classic Parallel Runtimes (MPI) Data

Distinction on static and variable data Configurable long running (cacheable) map/reduce tasks Pub/sub messaging

Main program’s process space Worker Nodes configure. Maps(. . ) Local Disk configure. Reduce(.

Master Node Pub/sub Broker Network B Twister Driver B B B Main Program One

Twister. R 4 Azure Queues for scheduling, Tables to store meta-data and monitoring data,

Twister 4 Azure • Distributed, highly scalable and highly available cloud services as the

Performance Comparisons BLAST Sequence Search Smith Waterman Sequence Alignment Cap 3 Sequence Assembly

Performance – Kmeans Clustering Task Execution Time Histogram Strong Scaling with 128 M Data

Performance – Multi Dimensional Scaling Weak Scaling Azure Instance Type Study Data Size Scaling

This demo is for real time visualization of the process of multidimensional scaling(MDS) calculation.

MDS projection of 100, 000 protein sequences showing a few experimentally identified clusters in

Client Node II. Send intermediate results Master Node Twister Driver Active. MQ Broker MDS

Master Node Twister Driver Twister-MDS Twister Daemon Pub/Sub Broker Network Twister Daemon map reduce

Map-Collective Model Iterations Input User Program map Initial Collective Step Network of Brokers reduce

Twister Daemon Node Active. MQ Broker Node B. Hierarchical Sending Twister Driver Node Broker-Driver

Twister-MDS Execution Time 100 iterations, 40 nodes, under different input data sizes 1600, 000

(In Method C, centroids are split to 160 blocks, sent through 40 brokers in

Twister New Architecture Master Node Worker Node Broker Configure Mapper map broadcasting chain Add

Chain/Ring Broadcasting Twister Daemon Node Twister Driver Node • Driver sender: • send broadcasting

Chain Broadcasting Protocol Daemon 2 Daemon 0 Daemon 1 Driver send receive handle data

Broadcasting Time Comparison on 80 nodes, 600 MB data, 160 pieces Broadcasting Time (Unit:

Applications & Different Interconnection Patterns Map Only Input map Output Classic Map. Reduce Input

Scheduling vs. Computation of Dryad in a Heterogeneous Environment 34 SALSA

Development of library of Collectives to use at Reduce phase Broadcast and Gather needed

Convergence is Happening Data Intensive Paradigms Data intensive application (three basic activities): capture, curation,

• • Future. Grid: a Grid Testbed IU Cray operational, IU IBM (i.

SALSAHPC Dynamic Virtual Cluster on Demonstrate the concept of Science Future. Grid -- Demo

SALSA HPC Group Indiana University http: //salsahpc. indiana. edu

Slides: 44

Download presentation

Workshop on Petascale Data Analytics: Challenges, and Opportunities, SC 11 SALSA HPC Group http: //salsahpc. indiana. edu School of Informatics and Computing Indiana University

Twister Bingjing Zhang, Richard Teng Twister 4 Azure Thilina Gunarathne High-Performance Visualization Algorithms For Data-Intensive Analysis Seung-Hee Bae and Jong Youl Choi

Dryad. LINQ CTP Evaluation Hui Li, Yang Ruan, and Yuduo Zhou Cloud Storage, Future. Grid Xiaoming Gao, Stephen Wu Million Sequence Challenge Saliya Ekanayake, Adam Hughs, Yang Ruan Cyberinfrastructure for Remote Sensing of Ice Sheets Jerome Mitchell

Intel’s Application Stack SALSA

Applications Support Scientific Simulations (Data Mining and Data Analysis) Kernels, Genomics, Proteomics, Information Retrieval, Polar Science, Scientific Simulation Data Analysis and Management, Dissimilarity Computation, Clustering, Multidimensional Scaling, Generative Topological Mapping Security, Provenance, Portal Services and Workflow Programming Model Runtime Storage Infrastructure Hardware High Level Language Cross Platform Iterative Map. Reduce (Collectives, Fault Tolerance, Scheduling) Distributed File Systems Object Store Windows Server Linux HPC Amazon Cloud HPC Bare-system Virtualization CPU Nodes Data Parallel File System Azure Cloud Virtualization Grid Appliance GPU Nodes SALSA

SALSA

What are the challenges? Providing both cost effectiveness parallel paradigms that These challenges must be met for and bothpowerful computation andprogramming storage. If computation and is capableare of handling theit’s incredible increases in dataset sizes. storage separated, not possible to bring computing to data. (large-scale data analysis for Data Intensive applications ) Data locality Research issues § its impact on performance; § the factors that affect data locality; § the maximum degree of data locality that can be achieved. § portability between HPC and Cloud systems Factors beyond data locality to improve performance § achieve scalingthe performance To best data locality is not always the optimal scheduling decision. For instance, the node where input data of a task are stored is overloaded, to run the task § faultiftolerance on it will result in performance degradation. Task granularity and load balance In Map. Reduce , task granularity is fixed. This mechanism has two drawbacks 1) limited degree of concurrency 2) load unbalancing resulting from the variation of task execution time.

Programming Models and Tools Map. Reduce in Heterogeneous Environment MICROSOFT 8

Motivation Data Deluge Experiencing in many domains Map. Reduce Classic Parallel Runtimes (MPI) Data Centered, Qo. S Efficient and Proven techniques Expand the Applicability of Map. Reduce to more classes of Applications Map-Only Input map Output Iterative Map. Reduce More Extensions iterations Input map reduce Pij reduce

Distinction on static and variable data Configurable long running (cacheable) map/reduce tasks Pub/sub messaging based communication/data transfers Broker Network for facilitating communication

Main program’s process space Worker Nodes configure. Maps(. . ) Local Disk configure. Reduce(. . ) Cacheable map/reduce tasks while(condition){ run. Map. Reduce(. . ) May send <Key, Value> pairs directly Iterations Reduce() Combine() operation update. Condition() } //end while close() Map() Communications/data transfers via the pub -sub broker network & direct TCP • Main program may contain many Map. Reduce invocations or iterative Map. Reduce invocations

Master Node Pub/sub Broker Network B Twister Driver B B B Main Program One broker serves several Twister daemons Twister Daemon map reduce Cacheable tasks Worker Pool Local Disk Worker Node Scripts perform: Data distribution, data collection, and partition file creation Worker Pool Local Disk Worker Node

Components of Twister

Twister. R 4 Azure Queues for scheduling, Tables to store meta-data and monitoring data, Blobs for input/output/intermediate data storage.

Iterative Map. Reduce for Azure § § §

Twister 4 Azure • Distributed, highly scalable and highly available cloud services as the building blocks. • Utilize eventually-consistent , high-latency cloud services effectively to deliver performance comparable to traditional Map. Reduce runtimes. • Decentralized architecture with global queue based dynamic task scheduling • Minimal management and maintenance overhead • Supports dynamically scaling up and down of the compute resources. • Map. Reduce fault tolerance

Performance Comparisons BLAST Sequence Search Smith Waterman Sequence Alignment Cap 3 Sequence Assembly

Performance – Kmeans Clustering Task Execution Time Histogram Strong Scaling with 128 M Data Points Number of Executing Map Task Histogram Weak Scaling

Performance – Multi Dimensional Scaling Weak Scaling Azure Instance Type Study Data Size Scaling Number of Executing Map Task Histogram

This demo is for real time visualization of the process of multidimensional scaling(MDS) calculation. We use Twister to do parallel calculation inside the cluster, and use Plot. Viz to show the intermediate results at the user client computer. The process of computation and monitoring is automated by the program.

MDS projection of 100, 000 protein sequences showing a few experimentally identified clusters in preliminary work with Seattle Children’s Research Institute

Client Node II. Send intermediate results Master Node Twister Driver Active. MQ Broker MDS Monitor Twister-MDS Plot. Viz I. Send message to start the job

Master Node Twister Driver Twister-MDS Twister Daemon Pub/Sub Broker Network Twister Daemon map reduce calculate. BC map reduce calculate. Stress Worker Pool Worker Node MDS Output Monitoring Interface

Map-Collective Model Iterations Input User Program map Initial Collective Step Network of Brokers reduce Final Collective Step Network of Brokers

Twister Daemon Node Active. MQ Broker Node B. Hierarchical Sending Twister Driver Node Broker-Driver Connection 7 Brokers and 32 Computing Nodes in total A. Full Mesh Network 5 Brokers and 4 Computing Nodes in total Broker-Daemon Connection Broker-Broker Connection C. Streaming

Twister-MDS Execution Time 100 iterations, 40 nodes, under different input data sizes 1600, 000 1508, 487 1404, 431 Total Execution Time (Seconds) 1400, 000 1200, 000 1000, 000 816, 364 737, 073 800, 000 600, 000 359, 625 400, 000 200, 000 189, 288 303, 432 148, 805 0, 000 38400 51200 Original Execution Time (1 broker only) Number of Data Points 76800 Current Execution Time (7 brokers, the best broker number) 102400

(In Method C, centroids are split to 160 blocks, sent through 40 brokers in 4 rounds) 100, 00 93, 14 90, 00 Broadcasting Time (Unit: Second) 80, 00 70, 56 70, 00 60, 00 46, 19 50, 00 40, 00 30, 00 24, 50 18, 79 13, 07 10, 00 400 M 600 M Method C Method B 800 M

Twister New Architecture Master Node Worker Node Broker Configure Mapper map broadcasting chain Add to Mem. Cache map Map Reduce merge Cacheable tasks reduce collection chain Twister Driver Twister Daemon

Chain/Ring Broadcasting Twister Daemon Node Twister Driver Node • Driver sender: • send broadcasting data • get acknowledgement • send next broadcasting data • … • Daemon sender: • receive data from the last daemon (or driver) • cache data to daemon • Send data to next daemon (waits for ACK) • send acknowledgement to the last daemon

Chain Broadcasting Protocol Daemon 2 Daemon 0 Daemon 1 Driver send receive handle data send receive get ack handle data send receive ack handle data ack get ack I know this is the end of Daemon Chain send receive ack handle data get ack send receive ack handle data ack get ack ack I know this is the end of Cache Block

Broadcasting Time Comparison on 80 nodes, 600 MB data, 160 pieces Broadcasting Time (Unit: Seconds) 30 25 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 Execution No. Chain Broadcasting All-to-All Broadcasting, 40 brokers

Applications & Different Interconnection Patterns Map Only Input map Output Classic Map. Reduce Input map Iterative Reductions Twister Input map Loosely Synchronous iterations Pij reduce CAP 3 Analysis Document conversion (PDF -> HTML) Brute force searches in cryptography Parametric sweeps High Energy Physics (HEP) Histograms SWG gene alignment Distributed search Distributed sorting Information retrieval Expectation maximization algorithms Clustering Linear Algebra Many MPI scientific applications utilizing wide variety of communication constructs including local interactions - CAP 3 Gene Assembly - Polar. Grid Matlab data analysis - Information Retrieval HEP Data Analysis - Calculation of Pairwise Distances for ALU Sequences - Kmeans - Deterministic Annealing Clustering - Multidimensional Scaling MDS - Solving Differential Equations and - particle dynamics with short range forces Domain of Map. Reduce and Iterative Extensions MPI SALSA

Scheduling vs. Computation of Dryad in a Heterogeneous Environment 34 SALSA

Runtime Issues 35 SALSA

Development of library of Collectives to use at Reduce phase Broadcast and Gather needed by current applications Discover other important ones Implement efficiently on each platform – especially Azure Better software message routing with broker networks using asynchronous I/O with communication fault tolerance Support nearby location of data and computing using data parallel file systems Clearer application fault tolerance model based on implicit synchronizations points at iteration end points Later: Investigate GPU support Later: run time for data parallel languages like Sawzall, Pig Latin, LINQ

Convergence is Happening Data Intensive Paradigms Data intensive application (three basic activities): capture, curation, and analysis (visualization) Cloud infrastructure and runtime Clouds Multicore Parallel threading and processes

• • Future. Grid: a Grid Testbed IU Cray operational, IU IBM (i. Data. Plex) completed stability test May 6 UCSD IBM operational, UF IBM stability test completes ~ May 12 Network, NID and PU HTC system operational UC IBM stability test completes ~ May 27; TACC Dell awaiting delivery of components Private FG Network Public NID: Network Impairment Device SALSA

SALSAHPC Dynamic Virtual Cluster on Demonstrate the concept of Science Future. Grid -- Demo aton. SC 09 on Clouds Future. Grid Dynamic Cluster Architecture Monitoring Infrastructure SW-G Using Hadoop SW-G Using Dryad. LINQ Linux Baresystem Linux on Xen Windows Server 2008 Bare-system XCAT Infrastructure i. Dataplex Bare-metal Nodes (32 nodes) Monitoring & Control Infrastructure Pub/Sub Broker Network Virtual/Physical Clusters XCAT Infrastructure Monitoring Interface Summarizer Switcher i. Dataplex Baremetal Nodes • Switchable clusters on the same hardware (~5 minutes between different OS such as Linux+Xen to Windows+HPCS) • Support for virtual clusters • SW-G : Smith Waterman Gotoh Dissimilarity Computation as an pleasingly parallel problem suitable for Map. Reduce style applications SALSA

SALSAHPC Dynamic Virtual Cluster on Demonstrate the concept of Science Future. Grid -- Demo atusing SC 09 on Clouds a Future. Grid cluster • Top: 3 clusters are switching applications on fixed environment. Takes approximately 30 seconds. • Bottom: Cluster is switching between environments: Linux; Linux +Xen; Windows + HPCS. Takes approxomately 7 minutes • SALSAHPC Demo at SC 09. This demonstrates the concept of Science on Clouds using a Future. Grid i. Data. Plex. SALSA

SALSA HPC Group Indiana University http: //salsahpc. indiana. edu