NSF Dibbs Award 5 yr Datanet CIF 21
NSF Dibbs Award • 5 yr. Datanet: CIF 21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science IU(Fox, Qiu, Crandall, von Laszewski), Rutgers (Jha), Virginia Tech (Marathe), Kansas (Paden), Stony Brook (Wang), Arizona State(Beckstein), Utah(Cheatham) • HPC-ABDS: Cloud-HPC interoperable software performance of HPC (High Performance Computing) and the rich functionality of the commodity Apache Big Data Stack. • SPIDAL (Scalable Parallel Interoperable Data Analytics Library): Scalable Analytics for Biomolecular Simulations, Network and Computational Social Science, Epidemiology, Computer Vision, Spatial Geographical Information Systems, Remote Sensing for Polar Science and Pathology Informatics. 1
Year 1 SPIDAL MIDAS Community: HPC Biomolecular Simulations Community: Network Science and Comp. Social Science Community: Computational Epidemiology Community: Spatial Community: Pathology Community: Computer vision: Community: Radar informatics: Community requirement and technology evaluation (i) Arch and design spec (ii) In-memory pilot abstract. , integrate with XSEDE Year 2 SPIDAL-MIDAS Interface and SPIDAL V 1. 0 SPIDAL scheduling components and execution proceesing. MIDAS on Blue Waters. V 1. 0 release Years 3 -5 Integrated testing with Algorithms & MIDAS. Extend to V 2. 0 Scalability testing, adaptors for new platforms, Support for tools and developers, Optimization, Phase II of execution-processing models, V 2. 0 Community requirements CPPTRAJ to integrate with (i) Parallel Trajectory and gathering MIDAS for ensemble analysis MDAnalysis with MR (ii) i. BIOMES on Blue Waters data mgmt. in MIDAS (iii) End-toend Integration of CPPTraj-MIDAS with SPIDAL (iv) Use SPIDAL Kmeans (v) Tutorials and outreach i) Gather community requirement i) Giraph-based clustering and i) Algorithm implementation for ii) study existing network analytic community detection problems subgraph problems algorithms ii) Integ of CINET in SPIDAL ii) Develop new algorithms as necessary Community requirement Design i) Implement the wrappers gathering i) Wrapper for Epi. Simdemics ii) Start implementing Giraphand Epi. Fast based tool ii) Giraph simulation tool iii) Integrate Epi. Simdemics and Epifast with SPIDAL (i) Community reqs (i) spatial 2 D clustering and (i) Implementation of 3 D spatial (ii) Spatial queries library and (ii) Geospatial & pathology queries. (ii) Application to 3 D 2 D parallel apps pathology (i) Implementation of 2 D image (i) Image registration, object (i) Continued implementation of preproc. , segment and feature matching & feature 3 D image processing library extraction and tumor research extraction (3 D) (ii) Application to liver and (ii) Integrate MIDAS neuroblastoma Port image processing, feature (i) Implement ML and (i) Continue implementing ML extraction, image matching, optimization algorithms; and global optimization; pleasingly parallel ML algos (ii) large-scale image (ii) large-scale 3 D recognition in recognition social images (i) single-echogram layer (i) Develop and implement finding, continent-scale layer finding (i) change detection and (ii) tile matching (ii) flow field estimation in satellite 2 images.
Machine Learning in Network Science, Imaging in Computer Vision, Pathology, Polar Science, Biomolecular Simulations Algorithm Applications Features Status Parallelism Graph Analytics Community detection Social networks, webgraph P-DM GML-Gr. C Subgraph/motif finding Webgraph, biological/social networks P-DM GML-Gr. B Finding diameter Social networks, webgraph P-DM GML-Gr. B Clustering coefficient Social networks Page rank Webgraph P-DM GML-Gr. C Maximal cliques Social networks, webgraph P-DM GML-Gr. B Connected component Social networks, webgraph P-DM GML-Gr. B Betweenness centrality Social networks Shortest path Social networks, webgraph Graph, static P-DM GML-Gr. C Non-metric, P-Shm GML-GRA P-Shm Spatial Queries and Analytics Spatial relationship based queries Distance based queries Spatial clustering Spatial modeling P-DM PP GIS/social networks/pathology informatics Geometric P-DM PP Seq GML Seq PP GML Global (parallel) ML Gr. A Static Gr. B Runtime partitioning 3
Some specialized data analytics in SPIDAL Algorithm • aa Applications Features Parallelism P-DM PP Seq PP Todo PP P-DM GML Core Image Processing Image preprocessing Object detection & segmentation Image/object feature computation Status Computer vision/pathology informatics Metric Space Point Sets, Neighborhood sets & Image features 3 D image registration Object matching Geometric 3 D feature extraction Deep Learning Network, Stochastic Gradient Descent Image Understanding, Language Translation, Voice Recognition, Car driving PP Pleasingly Parallel (Local ML) Sequential Available GRA Good distributed algorithm needed Connections in artificial neural net Todo No prototype Available P-DM Distributed memory Available P-Shm Shared memory Available 4
Some Core Machine Learning Building Blocks Algorithm Applications Features Status //ism DA Vector Clustering DA Non metric Clustering Kmeans; Basic, Fuzzy and Elkan Levenberg-Marquardt Optimization Accurate Clusters Vectors P-DM GML Accurate Clusters, Biology, Web Non metric, O(N 2) Fast Clustering Vectors Non-linear Gauss-Newton, use Least Squares in MDS Squares, DA- MDS with general weights Least 2 O(N ) DA-GTM and Others Vectors Find nearest neighbors in document corpus Bag of “words” Find pairs of documents with (image features) TFIDF distance below a threshold P-DM GML P-DM GML P-DM PP Todo GML Support Vector Machine SVM Learn and Classify Vectors Seq GML Random Forest Gibbs sampling (MCMC) Latent Dirichlet Allocation LDA with Gibbs sampling or Var. Bayes Singular Value Decomposition SVD Learn and Classify Vectors P-DM PP Solve global inference problems Graph Todo GML Topic models (Latent factors) Bag of “words” P-DM GML Dimension Reduction and PCA Vectors Seq GML Hidden Markov Models (HMM) Global inference on sequence Vectors models Seq PP & GML SMACOF Dimension Reduction Vector Dimension Reduction TFIDF Search All-pairs similarity search 5
Relevant DSC and XSEDE Computing Systems • DSC adding 128 node Haswell based (2 chips, 24 or 36 cores per node) system (Juliet) – – 128 GB memory per node Substantial conventional disk per node (8 TB) plus PCI based 400 GB SSD Infiniband with SR-IOV Back end Lustre • Older or Very Old (tired) machines – India (128 nodes, 1024 cores), Bravo (16 nodes, 128 cores), Delta(16 nodes, 192 cores), Echo(16 nodes, 192 cores), Tempest (32 nodes, 768 cores); some with large memory, large disk and GPU – Cray XT 5 m with 672 cores • Optimized for Cloud research and Large scale Data analytics exploring storage models, algorithms • Bare-metal v. Openstack virtual clusters • Extensively used in Education • XSEDE – Wrangler and Comet likely to be especially useful 6
Big Data Software Model 7
8
HPC ABDS SYSTEM (Middleware) >~ 266 Software Projects System Abstraction/Standards Data Format and Storage HPC ABDS Hourglass HPC Yarn for Resource management Horizontally scalable parallel programming model Collective and Point to Point Communication Support for iteration (in memory processing) Application Abstractions/Standards Graphs, Networks, Images, Geospatial. . Scalable Parallel Interoperable Data Analytics Library (SPIDAL) High performance Mahout, R, Matlab …. . High Performance Applications 9
Applications SPIDAL MIDAS ABDS Govt. Commercial Healthcare, Deep Research Astronomy, Earth, Env. , Energy Community Operations Defense Life Science Learning, Ecosystems Physics Polar & Examples Social Science Media (Inter)disciplinary Workflow SPIDAL Analytics Libraries Native ABDS SQL-engines, Storm, Impala, Hive, Shark MPI Programmin g& Map – Point to Runtime Collective Point, Graph Models HPC-ABDS Map. Reduce Native HPC Map Only, PP Classic Many Task Map. Reduce MIddleware for Data-Intensive Analytics and Science (MIDAS) API MIDAS Communication Data Systems and Abstractions (MPI, RDMA, Hadoop Shuffle/Reduce, (In-Memory; HBase, Object Stores, other HARP Collectives, Giraph point-to-point) No. SQL stores, Spatial, SQL, Files) Higher-Level Workload Management (Tez, Llama) Workload Management (Pilots, Condor) External Data Access (Virtual Filesystem, Grid. FTP, SRM, SSH) Framework specific Scheduling (e. g. YARN) Cluster Resource Manager (YARN, Mesos, SLURM, Torque, SGE) Compute, Storage and Data Resources (Nodes, Cores, Lustre, HDFS) Resource Fabric 10
- Slides: 10