Using Apache Drill and Unidata TDS for NASA

  • Slides: 16
Download presentation
Using Apache Drill and Unidata TDS* for NASA HDF-EOS on S 3 ESIP 2020

Using Apache Drill and Unidata TDS* for NASA HDF-EOS on S 3 ESIP 2020 Summer / HDF-EOS Workshop XXIII H. Joe Lee EED-2 / The HDF Group / Software Engineer hyoklee@hdfgroup. org *THREDDS Data Server This work was supported by NASA/GSFC under Raytheon Technologies contract number NNG 15 HZ 39 C. This document does not contain technology or Technical Data controlled under either the U. S. International Traffic in Arms Regulations or the U. S. Export Administration Regulations. SESIP-0720 -JL

Hierarchical Data Format-Earth Observing System • HDF 4 – HDF-EOS 2 • HDF 5

Hierarchical Data Format-Earth Observing System • HDF 4 – HDF-EOS 2 • HDF 5 – HDF-EOS 5 – net. CDF-4 2 SESIP-0720 -JL

HDF-EOS on S 3 • HDF 4? • No elegant solution other than GDAL*

HDF-EOS on S 3 • HDF 4? • No elegant solution other than GDAL* • Not so elegant: h 4 mapwriter / s 3 fs • HDF 5? • Many OK solutions exist • HDF 5 VFD**/ HSDS*** / GDAL / Hyrax DMR****++ / etc. • But “Just OK is not OK. ” *Geospatial Data Abstraction Library ** Virtual File Driver ***Highly Scalable Data Service ****Dataset Metadata Response 3 SESIP-0720 -JL

Apache Drill • Supports Variety of storage - Amazon S 3, Azure Blob Storage,

Apache Drill • Supports Variety of storage - Amazon S 3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files. • Data agility - query the raw data in-situ. • Table - in-memory shredded columnar representation for complex data • BI Tools and REST API 4 SESIP-0720 -JL

Apache Drill 1. 18 (beta) • Collection of HDF 5 files on S 3

Apache Drill 1. 18 (beta) • Collection of HDF 5 files on S 3 • ANSI SQL • Geoprocessing? 5 SESIP-0720 -JL

THREDDS Data Server 5. 0 (beta) It supports S 3! • both HDF 4

THREDDS Data Server 5. 0 (beta) It supports S 3! • both HDF 4 and HDF 5 • Nc. ML? • Catalog for collection of files? 6 SESIP-0720 -JL

net. CDF-Java • This is core library. • THREDDS / Panoply / IDV shares

net. CDF-Java • This is core library. • THREDDS / Panoply / IDV shares this. • tools. UI is a generic GUI tool based on net. CDF-Java. • Like GDAL, if net. CDF-Java works with S 3, the rest are trivial. 7 SESIP-0720 -JL

tools. UI - HDF 4 on S 3 8 SESIP-0720 -JL

tools. UI - HDF 4 on S 3 8 SESIP-0720 -JL

Benchmark: Terra. Fusion on S 3 • • Test file size: 24 G Format:

Benchmark: Terra. Fusion on S 3 • • Test file size: 24 G Format: HDF 5/net. CDF-4 CF One orbit data from 5 sensors on Terra S 3 access from EC 2 (m 4. xlarge) 9 SESIP-0720 -JL

Apache Drill fails after 7 minute. read on s 3 a: //basicterrafusion/TERRA_BF_L 1 B_O

Apache Drill fails after 7 minute. read on s 3 a: //basicterrafusion/TERRA_BF_L 1 B_O 535 57_20100112014327_F 000_V 001. h 5: com. amazonaws. Aborted. Exception: org. apache. drill. common. exceptions. User. E xception$Builder. build(User. Exception. jav a: 657) org. apache. drill. exec. store. hdf 5. HDF 5 Bat ch. Reader. convert. Input. Stream. To. File(HDF 5 Ba tch. Reader. java: 356) 10 SESIP-0720 -JL

TDS responds within 2 minutes. Float 32 /MOPITT/granule_20100112/Geolocation/Latitude[ntr ack_1 = 46][nstare = 29][npixels =

TDS responds within 2 minutes. Float 32 /MOPITT/granule_20100112/Geolocation/Latitude[ntr ack_1 = 46][nstare = 29][npixels = 4]; Float 32 /MOPITT/granule_20100112/Geolocation/Longitude[nt rack_1 = 36][nstare = 29][npixels = 4]; Float 64 /MOPITT/granule_20100112/Geolocation/Time[ntrack_ 1 = 436]; } s 3 test/TERRA_BF_L 1 B_O 53557_20100112014327_F 000_V 001. h 5; real 1 m 47. 065 s 11 SESIP-0720 -JL

h 5 ls responds in 2. 5 minutes. • HDF 5 Virtual File Driver

h 5 ls responds in 2. 5 minutes. • HDF 5 Virtual File Driver (VFD) • --enable-ros 3 -vfd configuration option It takes 2 X longer (5 minutes) outside AWS. 12 SESIP-0720 -JL

Role-based Access Control (RBAC) Drill THREDDS H 5 VFD Always Yes No • RBAC

Role-based Access Control (RBAC) Drill THREDDS H 5 VFD Always Yes No • RBAC eliminates access key and token. • Access with s 3: //bucket/key. h 5 (no https: //) • S 3 buckets and objects can be private. 13 SESIP-0720 -JL

THREDDS 5. 0 is a Clear Winner Based on our Benchmark Results. • •

THREDDS 5. 0 is a Clear Winner Based on our Benchmark Results. • • Performance is good. It supports HDF 4. RBAC is supported. Existing netcdf-Java / OPe. NDAP based software works seamlessly. 14 SESIP-0720 -JL

However, Use Case Still Matters There are many (read-only) solutions for HDF-EOS on S

However, Use Case Still Matters There are many (read-only) solutions for HDF-EOS on S 3: • SQL user? Try Drill after sanitization. • Good for Collection of HDF 5 files with 2 D Grid. • Use AWS Lambda (w/ CUMULUS) for sanitization. • Java user? Try net. CDF-Java. • Python user? Try GDAL vsis 3/ driver for HDF 5 and viscurl/ for HDF 4. • OPe. NDAP user? Try THREDDS 5. 0 beta. • HDF 5 C/Fortran user? Try HDF 5 VFD. 15 SESIP-0720 -JL

This work was supported by NASA/GSFC under Raytheon Technologies contract number NNG 15 HZ

This work was supported by NASA/GSFC under Raytheon Technologies contract number NNG 15 HZ 39 C. in partnership with 16 SESIP-0720 -JL