Comparing Net CDF and a multidimensional array database

Comparing Net. CDF and a multidimensional array database on managing and querying large hydrologic datasets: a case study of Sci. DB– P 5 Haicheng Liu 16 -10 -2021 Delft University of Technology Challenge the future

Outline § Background § Query design § Selection of multidimensional (MD) array database § Test environment setup § Benchmark test and analysis § Conclusions Geomatics for the Built Environment 2

Background Geomatics for the Built Environment 3

Net. CDF § A concept which can refer to data model, format or API § Data model § Dimension: physical dimension or index such as time step § Variable: core data stored, e. g. precipitation § Attribute: metadata of variables or file § Format § Classic, and 64 -bit offset format consisting of a header and a data array stored contiguously § Net. CDF-4, and Net. CDF-4 classic model format, support for dynamic schema and chunked storage Geomatics for the Built Environment 4

Problem for query § Contiguous storage structure adopted by classic and 64 -bit offset format 20 45 55 21 30 20 10 11 13 3 Grid 1 Grid 2 Grid 3 … Grid 1 Grid 2 Grid 3 One-dimensional array Geomatics for the Built Environment 5

MD array database § A database of which the abstract model for data management and query is multidimensional array consisting of dimensions and attributes § Many solutions § Open source: Rasdaman, Sci. DB, Monet. DB, etc. § Commercial: Essbase, Caché, Oracle spatial, etc. § Most utilize chunked storage structure Geomatics for the Built Environment 6

Possible solution § Chunked storage structure of Net. CDF-4 format and multidimensional (MD) array database 20 Index MD chunk § MD array database also has smarter caching strategy Geomatics for the Built Environment 7

Research question § Can a MD array database process frequently implemented queries faster than Net. CDF solutions for large hydrological datasets? Geomatics for the Built Environment 8

Roadmap Query design (Dataset selection) Selection of MD array database Test environment setup (Hydro. NET-4) Net. CDF connector MD array database connector Benchmark 64 -bit offset storage MD array database (normal chunk) Net. CDF-4 (compressed chunk) MD array database (compressed chunk) Geomatics for the Built Environment 9

Query design Geomatics for the Built Environment 10

Query and dataset collection § 6 experts interviewed in total § 19 conceptual queries categorized into 5 classes § Selection based on dimension value § Selection based on variable value § Masking query, e. g. data quality check § Statistical operation, e. g. Sum, Avg and Max § Spatial operation, e. g. intersection § Datasets include 1 D time series records, 2 D satellite images, 5 D forecast datasets, etc. Geomatics for the Built Environment 11

Datasets Dataset MPE (Multi. Sensor Precipitation Estimate) rainfall rate from satellite data product Information stored Dimension count Dimension Span (single file) Temporal resolution Spatial resolution and coverage Single file size Data format Rainfall rate; Availability; Quality 3 x, y, time (4000, 4) 15 minutes 0. 03 degree (3. 3 km), 1/3 world 250 MB 64 -bit offset 5 Longitude, latitude, forecast, ensemble, model run (360, 181, 40, 2 0, 1) 6 hours 1 degree (111 km), Global 1. 55 GB 64 -bit offset Temperature 2 m above ground; Maximum temperature 2 m above ground; Minimum temperature 2 m GEFS (Global above ground; Ensemble Relative humidity 2 m above Forecast ground ; System) Total precipitation; weather Total Cloud Cover; forecast data U-Component of Wind 10 m above ground; V-Component of Wind 10 m above ground; Data status Geomatics for the Built Environment 12

MPE & GEFS Ensemble Modelrun Latitude Longitude Forecast Time Longitude Latitude 3 D MPE 5 D GEFS Geomatics for the Built Environment 13

Query Designed § MPE dataset § Sub grid selection (Delft and northern part of the Netherlands) § Time series extraction (A spot location in the Indian Ocean) § Pyramid query (the Netherlands) § Average calculation (the Netherlands) § Maximum calculation (the Netherlands) § GEFS dataset § Time series extraction (Delft, one cell in GEFS) § Percentile calculation (Delft, one cell) § Ensemble mean calculation (the Netherlands and Europe) Geomatics for the Built Environment 14

Selection of MD array database Geomatics for the Built Environment 15

MD array database selection § Rasdaman and Sci. DB are focused on and compared § 9 criteria in total and different approaches are employed to assess each criterion, e. g. § Implementation of MD data storage structure: paper study, official documentation, forums, source code and discussion with developers § No practical tests are performed Geomatics for the Built Environment 16

MD array database selection Criterion Rasdaman Sci. DB License (i. e. commercial open-source) 1 1 Implementation of MD data storage structure 1 1 Lossless compression support 0 1 Parallelization 1 1 . Net API 0. 5 0 Query language 1 1 Spatial calculating capability 0 0 Net. CDF importer 1 0. 5 Maintenance 0. 5 1 Overall grade 6 6. 5 Final grade shows Sci. DB scores higher Geomatics for the Built Environment 17

Test environment setup Geomatics for the Built Environment 18

Benchmark architecture Geomatics for the Built Environment 19

Benchmark test and analysis Geomatics for the Built Environment 20

64 -bit offset Net. CDF files § MPE dataset § One file contain 4 time steps, 250 MB § A folder contains 1722 files § GEFS dataset § Only one file stored, containing 1 modelrun, 20 ensembles, 40 forecast steps, 181 latitudes and 360 longitudes, 1. 55 GB Geomatics for the Built Environment 21

Net. CDF-4 files § MPE dataset Data store name Net. CDF 4_C 2_C (compression) Chunk size (X x Y x Time) 4000 x 1 Single file size 250 MB 4000 x 1 3 MB § One file contains 4 time steps § Two folders created for the two data stores, each with 720 files § GEFS dataset (1 file for one data store) Data store name Net. CDF 4_GEFS_S 3_C (compression) Net. CDF 4_GEFS_S 5_C (compression) Chunk size (X x Y x Forecast x Ensemble x Modelrun) 360 x 181 x 1 x 20 x 1 360 x 181 x 1 x 1 x 1 Single file size 1. 55 GB 654 MB 1. 55 GB 561 MB Geomatics for the Built Environment 22

Sci. DB arrays § MPE dataset MPE data stored Time step count Sci. DB array size Original size of files in 64 bit offset format Tiny Small First 2 hours of 1 st September, 2013 First 6 hours of 1 st September, 2013 8 24 37 MB 112 MB 488 MB 1. 3 GB Medium 1 st September, 2013 96 448 MB 5. 7 GB Large 7 days from 1 st to 7 th September, 2013 672 3 GB 40 GB Very large 30 days of September, 2013 2880 13 GB 171. 6 GB Array level § Diverse chunk sizes and compression settings § GEFS dataset § 4 data schemas for storage -> modification of order of dimensions Geomatics for the Built Environment 23

6 chunk sizes for MPE arrays 4 x 800: C 3 4 x 100: C 5 4 x 4000: C 1 1 x 800: C 4 1 x 100: C 6 1 x 4000: C 2 Geomatics for the Built Environment 24

GEFS: effect of dimensions order 1 0 E 0 M F F Y X 0 1 0 F 0 0 M E E Y X 0 Geomatics for the Built Environment 0 25

Benchmark test § Two database systems (Net. CDF and Sci. DB) are benchmarked § Each specific query is run 20 times and the average of the middle 12 records is used as query response time § Network delay and query parsing for Sci. DB, such additional cost is between 0. 05 s to 0. 2 s Geomatics for the Built Environment 26

MPE sub grid selection Time Y X 0 Geomatics for the Built Environment 27

MPE sub grid selection Scheme C 1, C 1_C C 2, C 2_C C 3, C 3_C C 4, C 4_C C 5, C 5_C C 6, C 6_C Chunk size 4 x 4000 1 x 4000 4 x 800 1 x 800 4 x 100 1 x 100 Selecting grid covering the northern part of the Netherlands Geomatics for the Built Environment 28

GEFS forecast time series extraction 1 modelrun Forecast Y X X 0 Forecast Y Y X 0 0 Ensemble Geomatics for the Built Environment 29

GEFS forecast time series extraction 3, 000 Dimensions order MEFYX MFYXE XYFEM Average query response time (s) Scheme S 1, S 1_C S 2, S 2_C S 3, S 3 _C S 5, S 5_C 2, 500 2, 000 2, 702 Chunk size 1 x 20 x 181 x 360 1 x 181 x 360 x 20 360 x 181 x 20 x 1 360 x 181 x 1 x 1 2, 281 1, 430 1, 500 1, 142 1, 061 0, 910 1, 000 0, 500 0, 109 48. 031 23. 112 _C S 2 EF S_ G Sc i. D B_ G B_ i. D Sc EF S 1 S_ _C S 1 S_ G B_ i. D Sc G F 4 _ N et CD EF S 5 EF S_ EF _G F 4 CD et N _C S 5 S_ _C S 3 S_ EF _G N et CD F 4 CD et N 64 -b it of fs _G et EF G EF S_ S S 3 0, 000 Data store Extracting precipitation forecast time series from Delft, a spot location Geomatics for the Built Environment 30

Overall evaluation 64 bit offset Net. CDF 4 DEFLATE compression Sci. DB array DEFLATE compression Data loading Storage 5 1 4 1 3 5 1 3 1 4 Scheme transformation 1 1 1 4 4 7 6 9 8 9 MPE sub grid selection 4 5 1 3 2 MPE time series extraction 2 5 1 4 3 MPE average calculation 4 5 1 3 3 MPE maximum calculation 4 5 1 3 3 GEFS forecast time series extraction 5 4 2 3 1 GEFS percentile calculation 5 4 3 2 1 GEFS ensemble mean calculation 5 4 3 2 1 29 32 12 20 14 6. 48 6. 57 4. 71 5. 52 5. 00 Data solution Management overall score Query overall score Compound score (management * 0. 33 + query * 0. 14) Net. CDF-4 ranks the first, then 64 -bit offset, Sci. DB solutions come after Geomatics for the Built Environment 31

Conclusions and future work Geomatics for the Built Environment 32

Summary § Within the scope of research, Net. CDF-4 without compression is the best solution for managing and querying large hydrologic datasets § For Sci. DB, small chunk size is preferable but overload of huge in-memory metadata of chunks (i. e. <Instance. ID, Array. ID, Chunk. ID, Version. ID>) is a problem § DEFLATE compression of Sci. DB arrays can either have negative or no effect on query performance Geomatics for the Built Environment 33

Summary § Correlation between Sci. DB DEFLATE compression and chunk size is observed in time series extraction § With hypercubic and modest chunk sizes, the internal data structure of chunks in Sci. DB has insignificant influence on query performance. § Masking query, e. g. data quality check as well as spatial operation should be included in comprehensive benchmarking Geomatics for the Built Environment 34

Future work § Generic chunk model, to determine best chunk size for querying § More realistic benchmark test, e. g. analyze Hydrologic Research query log and simulate scenarios § Test with less memory capacity with focus on Net. CDF § Parallel query processing and parallel loading for Sci. DB Geomatics for the Built Environment 35

Reflection § Knowledge gained from Geo-database and Geoweb courses are utilized, e. g. blocks to store images, HTTP communication § The research makes use of geomatics techniques to solve water problems § The research fulfills organizations’ needs (Hydrologic, Deltares, etc) and contribute water services to the public Geomatics for the Built Environment 36

Questions? Geomatics for the Built Environment 37
- Slides: 37