Spark Mongo Db for LSST Christian Arnault LAL
Spark & Mongo. Db for LSST Christian Arnault (LAL) Réza Ansari (LAL) Fabrice Jammes (LPC Clermont) Osman Aidel (CCIN 2 P 3) César Richard (U-PSud) June, 15 2015 LSST Workshop - CCIN 2 P 3 1
Topics… • Spark – How to consider parallelism & distribution in the processing workflows • How to cope with Intermediate data • Manage steps in the workflow • Production the final data (catalogues) – How to distribute data (data formats) • Avro/Parquet (converting FITS format) • Mongo. Db – To understand whether Mongo might offer similiar features as QServ • Spark (again) – Same question but using the Spark-Dataframe technology – Combined with the Geo. Spark module for 2 D indexing June, 15 2015 LSST Workshop - CCIN 2 P 3 2
Spark: the simplified process Simulation Images Observation Calibration Sky background Reference Catalogues Object Detection Objets {x, y, flux} Astrométry Photométry, Photo. Z Measured Objets {RA, DEC, flux, magnitude, Z} June, 15 2015 Catalogues LSST Workshop - CCIN 2 P 3 3
Typical numbers • Camera – 3, 2 Gpixels – 15 To per night (x 10 years) – Image • Diameter: 3. 5° / 64 cm -> 9, 6 °² (Moon = 0, 5°) • ~ 300 000 x 6 CCD images – 189 CCDs / 6 filters • CCD – – 16 Mpixels (= 1 FITS file) 16 cm² 3 Go/s 0, 05 °² = 3 ‘ 2. 9 ’’ • Pixels – 10 µm , 0, 2 arc-secs – 2 bytes June, 15 2015 LSST Workshop - CCIN 2 P 3 4
Algorithms • Simulation: – Apply a gaussian pattern with common width (i. e. we only consider atmosphere and optical aberrations) + some noise • Detection: – Convolution with a gaussian pattern for PSF – Handle an overlap margin for objects close to the image border • Identification: – Search for geo-2 D coordinates from reference catalogues • Handling large number of datafiles – Based on multiple indexing keys(run, filter, ra, dec, …) aka ‘data butler’ • Studying the transfer mechanisms – Throughput, serialization June, 15 2015 LSST Workshop - CCIN 2 P 3 5
June, 15 2015 LSST Workshop - CCIN 2 P 3 6
Images creation Declare a schema def make_schema(): schema = Struct. Type([ For: Struct. Field("id", Integer. Type(), True), - Serialization of images Struct. Field("run", Integer. Type(), True), Struct. Field("ra", Double. Type(), True), - For data partitioning & indexing Struct. Field("dec", Double. Type(), True), Struct. Field("image", Array. Type(Double. Type()), True ))]) return schema def create_image(spark): runs =. . . rows = 3; cols = 3; region_size = 4000 images = []; image_id = 0 # initialize image descriptors for run in range(runs) for r in range(rows): for c in range(cols): ra =. . . ; dec =. . . images. append((image_id, run, ra, dec )) image_id += 1 rdd = sc. parallelize(images). map(lambda x: fill_image(x)) df = spark. create. Data. Frame(rdd, make_schema()) df. write. format("com. databricks. spark. avro ") . mode("overwrite") . save(". /images") June, 15 2015 LSST Workshop - CCIN 2 P 3 def fill_image(image): filled =. . . return filled Spark 7
Working on images • using RDD – Structured data – Selection via map, filter operations The User Defined Functions (UDF) may be written in any language Eg: In C++ and interfaced using Py. Bind def analyze(x): return 'analyze image', x[0] def read_images(spark): df = spark. read. format("com. databricks. spark. avro"). load(". /images") rdd = (df. rdd . filter(lambda x: x[1] == 3) . map(lambda x: analyze(x))) Select a data subset result = rdd. collect() print(result) June, 15 2015 LSST Workshop - CCIN 2 P 3 8
Working on images • Using Data. Frame – Appears like row-colums – Image Indexing by run/patch/ra/dec/filter… def analyze(x): return 'analyze image', x[0] def read_images(spark): analyze = functions. udf(lambda m: analyze(m), <type >) df = (spark. read. load(". /images") . filter(df. run == 3) . select(df. run, analyze(df. image). alias('image'))) df. show() June, 15 2015 LSST Workshop - CCIN 2 P 3 9
June, 15 2015 LSST Workshop - CCIN 2 P 3 10
Using Mongo. DB for ref. catalog client = pymongo. Mongo. Client(MONGO_URL) lsst = client. lsst Object ingestion stars = lsst. stars for o_id in objects: o = objects[o_id] Conversion to BSON object = o. to_db() object['center'] = {'type': 'Point', 'coordinates': [o. ra, o. dec]} id = stars. insert_one(object) Add 2 D indexing stars. create_index([('center', '2 dsphere')]) Object finding center = [[cluster. ra(), cluster. dec()] for o in stars. find({'center': {'$geo. Within': {'$center. Sphere ': center, radius]}}}, {'_id': 0, 'where': 1, 'center': 1 }): print('identified object') June, 15 2015 LSST Workshop - CCIN 2 P 3 11
The Spark cluster @ LAL • Operated in the context of Virtual. Data and the mutualisation project ERM/MRM (Université Paris-Sud) • This project groups several research teams in U-PSud (genomics, bio-informatics, LSST) both studying the Spark technology. • We had a Spark school (in march 2017) (with the help of an expert from Databricks) June, 15 2015 LSST Workshop - CCIN 2 P 3 12
Mongo U-Psud: Open. Stack, Cent. OS 7 4 To LSST Master 18 c 32 Go HDFS 2. 5 To 108 cores 192 RAM 12 To June, 15 2015 Hadoop 2. 6. 5 Spark 2. 1. 0 Java 1. 8 Python 3. 5 Mongo 3. 4 Worker 18 c Worker 2 To 18 c Worker 32 Go 2 To 18 c 32 Go 2 To 32 Go LSST Workshop - CCIN 2 P 3 13
Mongo. Db • Several functional characteristics of the QServ system seem to be obtained using the Mongo. Db tool, Among which we may quote: – Ability to distribute both the database and the server through the intrinsic Sharding mechanism. – Indexing against 2 D coordinates of the objects – Indexing against a sky splitting in chunks (so as to drive the sharding) • Thus, the study purpose is to evaluate if: – the Mongo. Db database offers natively comparable or equivalent functionality – the performances are comparable. June, 15 2015 LSST Workshop - CCIN 2 P 3 14
Mongo. Db in the Galactica cluster • One single server – – – Name: Mongo. Server_1 Gabarit: C 1. large RAM: 4 Go VCPUs: 8 VCPU Disk: 40 Go • The tests are operated upon a dataset of 1. 9 To: – – Object (79226537 documents) Source (1426096034 documents) Forced. Source (7151796541 documents) Object. Full. Overlap (32367384 documents) • These catalogues are prepared to concern sky regions (identified by a chunk. Id). Therefore, 324 sky regions are available for any of the 4 catalog types. June, 15 2015 LSST Workshop - CCIN 2 P 3 15
Operations • Ingestion: – Translating the SQL schema into a Mongo. Db schema (i. e. selecting the data types) – Ingesting the CSV lines – Automatic creation of the indexes from the SQL keys described in the SQL schema. • Testing simple queries select count(*) from Object 0. 002 seconds select count(*) from Forced. Source 0. 000 seconds SELECT ra, decl FROM Object WHERE deep. Source. Id = 2322374716295173; 0. 014 seconds SELECT ra, decl FROM Object WHERE qserv_areaspec_box(…); 0. 343 seconds select count(*) from Object where y_inst. Flux > 5; 0. 008 seconds select min(ra), max(ra), min(decl), max(decl) from Object; 0. 432 seconds select count(*) from Source where flux_sinc between 1 and 2; 0. 354 seconds select count(*) from Source where flux_sinc between 2 and 3; 0. 076 seconds • But … measures done with indexes on quantities… • We don’t want to index any of 300 parameters • Better structure space parameters and index over groups LSST Workshop - CCIN 2 P 3 16
Joins, Aggregations • Mongo operate complex queries using an aggregation of map-reduce operations (based on iterators) • Example: finding all neighbours with distance < Dmax within a region – select a sky region around a reference point – build a self-join so as to obtain a list of object couples – compute the distance between objects in every couple – select all computed distances lower than a maximum value. June, 15 2015 LSST Workshop - CCIN 2 P 3 17
Aggregation result = lsst. Object. aggregate( [ Select objects in a {'$geo. Near': { region 'near': [ra 0, dec 0], 'query': { 'loc': { '$geo. Within': {'$box': [bottomleft, topright] } } }, 'distance. Field': 'dist', Construct all pairs within the } }, region {'$lookup': {'from': 'Object', 'local. Field': 'Object. loc', Flatten the list 'foreign. Field': 'Object. loc', 'as': ‘neighbours'} }, Remove the duplication {'$unwind': '$neighbours'}, {'$redact': { '$cond': [{ '$eq': ["$_id", "$ neighbours. _id"] }, "$$PRUNE", "$$KEEP" ] } }, {'$add. Fields': {'dist': dist} }, Compute the distance Filter {'$match': {'dist': { '$lt': 1 } }, between pairs {'$project': {'_id': 0, 'loc': 1, ' neighbours. loc': 1, 'dist': 1}}, Final projection ] ) June, 15 2015 LSST Workshop - CCIN 2 P 3 18
Spark/Dataframes • Context – – – Same dataset, same objective Virtual. Data cluster @ LAL Ingest the dataset using the CSV connector to Dataframes Operate SQL-like API to query Use the Geo. Spark for 2 D navigation, filtering, indexing http: //geospark. datasyslab. org/ • Objects: Point, Rectangle, Polygon, Line. String • Spatial index: R-Tree and Quad-Tree • Geometrical operations: Minimum Bounding Rectangle, Polygon. Union, and Overlap/Inside(Self-Join) • Spatial query operations: Spatial range query, spatial join Jia Yu, Jinxuan Wu, Mohamed Sarwat. In Proceeding of query and spatial KNN query IEEE International Conference on Data Engieering IEEE ICDE 2016, Helsinki, Finland May 2016 June, 15 2015 LSST Workshop - CCIN 2 P 3 19
CSV Ingestion to Spark catalog. read_schema() set_schema_structures() Get the SQL Schema & produce the Spark representation of this schema spark = Spark. Session. builder. app. Name("Store. Catalog"). get. Or. Create() sc = spark. Context Get CSV files from sql. Context = SQLContext(sc) HDFS cat = subprocess. Popen(["hadoop", "fs", "-ls", "/user/christian. arnault/swift"], stdout=subprocess. PIPE) for line in cat. stdout: Get the Spark file_name = line. split('/')[-1]. strip() Schema schema = read_data(file_name) df = sql. Context. read. format('com. databricks. spark. csv') . options(header='true', delimiter='; ') . load('swift/' + file_name, schema = schema. structure) df. write. format("com. databricks. spark. avro") June, 15 2015 LSST Workshop - CCIN 2 P 3 . mode(write_mode). partition. By('chunk. Id'). save(". /lsstdb") Read the CSV file Append the data into the dataframe 20
Read the dataframe and query val conf = new Spark. Conf(). set. App. Name("DF") val sc = new Spark. Context(conf) val spark = Spark. Session . builder() Scala . app. Name("Read Dataset") . get. Or. Create() Read the dataframe from HDFS using the Avro serializer val sql. Context = new SQLContext(sc) var df = time("Load db", sql. Context. read. format("com. databricks. spark. avro"). load(". /lsstdb")) Perform queries val df = time("sort", df. select("ra", "decl", "chunk. Id"). sort("ra")) val seq = time("collect", df. rdd. take(10)) println(seq) June, 15 2015 LSST Workshop - CCIN 2 P 3 21
Conclusion • Spark is a rich and promising eco-system • But it needs configuration understanding: – – Memory (RAM) Partitioning data (throughput) Building the pipeline (as a DAG of process) Understanding the monitoring tools (eg. Ganglia) • Mongo. Db: – Powerful, but based on a very different paradigm as SQL (map-reduce based) – I observed strange performance results that need to be understood • Spark for catalogues – Migrating to Spark/Dataframe seems to be really encouraging and should not show the same limitations… • Primary results are at least better than Mongo (especially at ingestion step) • Geo. Spark powerful and meant to support very large datasets June, 15 2015 LSST Workshop - CCIN 2 P 3 22
- Slides: 22