CSIRO Jupyter notebook experience CEOS WGISS51 2021 04
CSIRO Jupyter notebook experience CEOS WGISS-51, 2021 -04 -21 Matt Paget, Rob Woodcock, Alex Hunt Australia’s National Science Agency
Key points Jupyter Notebooks and scalable data analytics 1. Details • Cloud-native platforms and EO archives • Advanced python tools designed for cloud solutions (Dask, Holoviz) 2. Challenges • Working with the cutting-edge software tools • Learning new patterns for lazy data analytics 3. Emerging outcomes • Data analytics patterns for ODC, Xarray and Dask and interactive visualisation with Holoviz • Positive experiences for scientists using cloud data & platforms WGISS-51 Tech Expo CSIRO Jupyter notebook experience
Demonstration ODC, Xarray and Dask time series processing on 20 GB of Landsat 5, 7 and 8 data. Uses ODC Grid. Workflow and dask futures to partition the dask tasks. Video: https: //ceos. org/meetings/wgiss-51 Notebook: https: //dev. azure. com/csiro-easi/easi-hub-public/_git/WGISS-51 Tech Expo CSIRO Jupyter notebook experience Analysis: Monthly NDMI time series fitted with cubic least squares polynomial per pixel. Combined array size: {time: 800, y: 1976, x: 2420)
Cloud native platforms and EO Cloud-native platforms • Key features: Scalable, Fault tolerant • Infrastructure as code (Terraform, Kubernetes) • Example: CEOS Earth Analytics Interoperability Lab • Includes Open Data Cube, Python data analytics libraries (Pangeo stack, Machine Learning, Visualisation) EO archives • Directly index and read AWS(/cloud) Landsat C 2 and Sentinel-2 • Thank you to USGS, Copernicus, Element-84, GA and others for making these archives available! WGISS-51 Tech Expo CSIRO Jupyter notebook experience
Advanced python tools Key libraries relevant to this talk • Xarray – Named multidim arrays with coordinates. . http: //xarray. pydata. org • Dask – Distribute processing tasks to worker nodes. . https: //dask. org • Holoviz – Interactive visualisation supporting dask and xarray (also Pandas, Matplotlib and Bokeh). . https: //holoviz. org Open-source libraries are in active development - New features and improvements added regularly - Interfaces and functions may change with new major versions WGISS-51 Tech Expo CSIRO Jupyter notebook experience
Data analytics patterns • • Dask scheduler determines and manages the tasks across all workers Dask scheduler is limited by RAM on the node it runs ODC Grid. Workflow can partition an AOI, reduce dask tasks per partition, combine the results A Workflow orchestration tool (Argo, Airflow) can provide resilience and auto-recovery Pattern Limits Use types <= 10 GB data Local RAM Small AOI, trial dataset 10 s-100 GB data Dask scheduler < 500, 000 tasks Medium AOI (1000 s x 1000 s pixels x 10 s time) > 100 s GB data. Partitioned dask scheduler Susceptible to partition or scheduler failure Large AOI or intensive calculations Workflow with auto-recovery Cloud resources. Error tracking and handling WGISS-51 Tech Expo CSIRO Jupyter notebook experience Continental-scale or operational processing
CSIRO EASI user experiences Open Data Cube, Xarray and Dask allow datasets to be loaded and processed as “lazy” on-demand. Analytics and algorithms need to be considered in terms of: • Dask “chunks” – where the work occurs • Dask scheduler – how much work overall • Data interdependence – limit data moving between workers The task was completed in the time it took for me to have lunch! Understanding dask (online course) led me to reimagine my processing and will save me weeks of future work. WGISS-51 Tech Expo CSIRO Jupyter notebook experience
- Slides: 7