Large Scale Data Processing Techniques for Astronomical Applications

  • Slides: 18
Download presentation
Large Scale Data Processing Techniques for Astronomical Applications Shanjiang Tang School of Computer Science

Large Scale Data Processing Techniques for Astronomical Applications Shanjiang Tang School of Computer Science & Technology, Tianjin University http: //cs. tju. edu. cn/faculty/tangshanjiang Nov 30 th, 2017

Astronomical Data is BIG • Lots of data is being collected and warehoused. –

Astronomical Data is BIG • Lots of data is being collected and warehoused. – Astronomical Images – Cosmological simulation – Data volume: e. g. , 100 TB, 10 PB.

Problems of Existing Applications • Many astronomical applications are NOT suitable for big data.

Problems of Existing Applications • Many astronomical applications are NOT suitable for big data. – Not SCALE well to large data volume. (e. g. , Memory-bounded application: Early Black Holes in Cosmological Simulations) – Not FAST enough for large data volume. (e. g. , astronomical correlation function: Baryon Acoustic Oscillations) How to handle such large-scale data applications?

Popular Data Processing Frameworks • Lots of frameworks are emerging for different applications. –

Popular Data Processing Frameworks • Lots of frameworks are emerging for different applications. – – – Batch applications Streaming applications Interactive applications Graph applications Deep learning applications 4

Hadoop Overview • Open-source implementation of Map. Reduce – YARN: the second generation/version of

Hadoop Overview • Open-source implementation of Map. Reduce – YARN: the second generation/version of Hadoop – Scale up to 6, 000 -10, 000 machines – Support for multi-tenancy 5 Borrowed from http: //blog. c 2 b 2. co. uk/2014/05/hadoop-v 2 -overview-and-cluster-setup-on. html 5

Map. Reduce is a Promising Choice for Big Data Processing • Map. Reduce, proposed

Map. Reduce is a Promising Choice for Big Data Processing • Map. Reduce, proposed by Google 2004, is inspired by the map and reduce combinators of Lisp. • Map: (key 1, val 1) → (key 2, val 2). The map function takes as input <key, value> pairs and produces a set of zero or more intermediate <key, value> pairs. • The framework group together all the intermediate values associated to the same intermediate key and passes them to the reducer. • Reduce: (key 2, [val 2]) → [val 3]. The reduce function aggregate the values of a key by using a binary operation, such as the sum. 6

Hadoop-based Astronomy applications • Bin Fu, et al. Disc. Finder: A Data-Intensive Scalable Cluster

Hadoop-based Astronomy applications • Bin Fu, et al. Disc. Finder: A Data-Intensive Scalable Cluster Finder for Astrophysics, In HPDC’ 10. • Bin Fu, et al. Exact and Approximate Computation of a Histogram of Pairwise Distances between Astronomical Objects, Astro. HPC 2012. • H. Willey, et al. Astronomy in the Cloud: Using Map. Reduce for Image Coaddition. In Arxiv, 2012. • C. C Mi, et al. An Efficient Cross-Match Implementation Based on Directed Join Algorithm in Map. Reduce, in UCC’ 11. • K. S Lee, et al. An Efficient Astronomical Crossmatching model Based on Map. Reduce Mechanism, In 7 ASE BD&SI‘ 15.

Disc. Finder: A Distributed Version of Friends-of-Friends Technique • Friends-of-Friends Algorithm for Galaxies clustering

Disc. Finder: A Distributed Version of Friends-of-Friends Technique • Friends-of-Friends Algorithm for Galaxies clustering – Two galaxies are “Friends” if they are close to each other – Vertices denotes galaxies, and their friendships are edges. – Time complexity is O((n · log n)1. 5) for the exact computation, and O(n) for an approximate algorithm. Galaxy clusters and space partitioning 8

Map. Reduce-based Approach • Map. Reduce “wrapper” distributes the friends-offriends computing across nodes. –

Map. Reduce-based Approach • Map. Reduce “wrapper” distributes the friends-offriends computing across nodes. – Divide the space into cubes – Apply a sequential friends-of friends procedure with each cube – Identify cross-cube “friendships” and merge the respective clusters. 9

Map. Reduce-based Approach(cont’) • The partitioning and clustering stages for Disc. Finder are based

Map. Reduce-based Approach(cont’) • The partitioning and clustering stages for Disc. Finder are based on Map. Reduce 10

Spark Overview • An fast and general large-scale data processing system – Much faster

Spark Overview • An fast and general large-scale data processing system – Much faster than Hadoop due to in-memory computing – Fault tolerance support. – Support graph, streaming, interactive and machine learning processing. Spark ecosystem Logistic regression in Hadoop and Spark 11

Spark-based Astronomy applications • Zhao Zhang et al. Scientific Computing Meets Big Data Technology:

Spark-based Astronomy applications • Zhao Zhang et al. Scientific Computing Meets Big Data Technology: An Astronomy Use Case. In IEEE big data, 2015. • Zhao Zhang et al. Kira: Processing Astronomy Imagery Using Big Data Technology. In IEEE TBD 2016. • Panos LABROPOULOS et al. Distributed Data Processing Using Spark in Radio Astronomy, in TERATEC 2016. • Mariem Brahem et al. Astro. Spark - Towards a Distributed Data Server for Big Data in Astronomy, SIGSPATIAL Workshop, 2016. 12

Kira: An Astronomy Image Processing Toolkit built on Spark • A Typical Supernovae Detection

Kira: An Astronomy Image Processing Toolkit built on Spark • A Typical Supernovae Detection Pipeline Images Source Extraction Catalogs Point Spread Function Estimation Image Reprojecti on Image Coaddition Source Extraction Object Classificati on 13

Source Extractor Steps Background Estimation Background Subtraction Object Detection through Convolution Object Statistics Evaluation

Source Extractor Steps Background Estimation Background Subtraction Object Detection through Convolution Object Statistics Evaluation 14

Kira Source Extractor Architecture 15

Kira Source Extractor Architecture 15

Experimental Results • 1 TB Dataset Performance between Kira SE VS C 16

Experimental Results • 1 TB Dataset Performance between Kira SE VS C 16

Conclusion • The huge volume and rapid growth of dataset in scientific computing such

Conclusion • The huge volume and rapid growth of dataset in scientific computing such as Astronomy demand for a fast and scalable data processing system. • Leveraging a big data platform such as Hadoop/Spark would enable scientists to benefit from the rapid pace of innovation and large range of systems that are being driven by widespread interest in big data analytics. • Finally, in the era of big data, astronomical informatics is a must and our team has done many big data processing work on that. 17

 Thanks! Question? 18

Thanks! Question? 18