Project Athena Technical Issues Larry Marx and the

  • Slides: 13
Download presentation
Project Athena: Technical Issues Larry Marx and the Project Athena Team

Project Athena: Technical Issues Larry Marx and the Project Athena Team

Outline Project Athena Resources Models and Machine Usage Experiments Running Models Initial and Boundary

Outline Project Athena Resources Models and Machine Usage Experiments Running Models Initial and Boundary Data Preparation Post Processing, Data Selection and Compression Data Management

Dedicated, Oct’ 09 – Mar’ 10 post-processing Dedicated, Oct’ 09 – Mar’ 10 79

Dedicated, Oct’ 09 – Mar’ 10 post-processing Dedicated, Oct’ 09 – Mar’ 10 79 million core-hours Shared, Oct’ 09 – Mar’ 10 5 million core-hours Verne Athena Kraken 5 nodes @ 32 cores, 128 GB mem 4512 nodes @ 4 cores, 2 GB mem 8256 nodes @ 12 cores, 16 GB mem Read-only scratch 78 TB (Lustre) nakji 360 TB (Lustre) 800+ TB HPSS tape archive homes 8 TB (NFS)

Models and Machine Usage NICAM initially was the primary focus of implementation Limited flexibility

Models and Machine Usage NICAM initially was the primary focus of implementation Limited flexibility in scaling, due to icosahedral grid Limited testing on multicore/cache processor architectures; production primarily on the vector-parallel (NEC SX) Earth Simulator Step 1: Port low resolution version with simple physics to Athena Step 2: Determine highest resolution possible on Athena and minimum and maximum number of cores to be used Unique solution: G-level = 10 or 10, 485, 762 cells (7 -km spacing) using exactly 2, 560 cores Step 3: Initially NICAM jobs failed frequently due to improper namelist settings. During visit by U. Tokyo and JAMSTEC scientists to COLA, new settings determined that generally ran with little trouble. However 2003 could never be stabilized and was abandoned.

Models and Machine Usage (cont’d) IFS flexible scalability sustains good performance for higher resolution

Models and Machine Usage (cont’d) IFS flexible scalability sustains good performance for higher resolution configurations (T 1279 and T 2047) using 2, 560 processor cores We defined one “slot” as 2, 560 cores and managed a mix of NICAM and IFS jobs @ 1 job per slot maximally efficient use of resource. Having equal size slots for both models permits either model to be queued and run in the event of a job failure. Selected jobs given higher priority so that they continue to run ahead of others. Machine partition: 7 slots of 2, 560 cores = 17, 920 cores out of 18, 048 99% machine utilization 128 processors for pre- and post-processing and as spares (postpone reboot) Lower resolution IFS experiments (T 159 and T 511) were run on Kraken IFS runs were initially made by COLA. When the ECMWF SMS model management system was installed, runs could be made by COLA or ECMWF.

Project Athena Experiments

Project Athena Experiments

Initial and Boundary Data Preparation IFS: Most input data prepared by ECMWF. Large files

Initial and Boundary Data Preparation IFS: Most input data prepared by ECMWF. Large files shipped by removable disk. Time Slice experiment input data prepared by COLA. NICAM: Initial data from GDAS 1° files. Available for all dates. Boundary files other than SST included with NICAM. SST from ¼° NCDC OI daily (version 2). Data starting 1 June 2002 include in situ, AVHRR (IR), and AMSR-E (microwave). Earlier data does not include AMSR-E. All data interpolated to icosahedral grid.

Post Processing, Data Selection and Compression All IFS (Grib-1) data interpolated (coarsened) to the

Post Processing, Data Selection and Compression All IFS (Grib-1) data interpolated (coarsened) to the N 80 reduce grid for common comparison among the resolutions and with the ERA-40 data. All IFS spectral data truncated to T 159 coefficients and transformed to N 80 full grid. Key fields at full model resolution were processed, including transforming spectral coefficients to grids and compression to Net. CDF -4 via Gr. ADS. Processing accomplished using Kraken, because Athena lacks sufficient memory and computing power on each node. All the common comparison and selected high-resolution data electronically transferred to COLA via bbcp (up to 40 MB/s sustained).

Post Processing, Data Selection and Compression (cont’d) Nearly all (91) NICAM diagnostic variables saved.

Post Processing, Data Selection and Compression (cont’d) Nearly all (91) NICAM diagnostic variables saved. Each variable saved with (2560) separate files for model domains, resulting in over 230, 000 files. The number of files quickly saturated LFS. Original program to interpolate data to regular lat-lon grid had to be revised to use less I/O and to multithread, thereby eliminating a processing backlog. Selected 3 -d fields were interpolated from z-coordinate to pcoordinate levels. Selected 2 - and 3 -d fields were compressed (Net. CDF-4) and electronically transferred to COLA. All selected fields coarsened to N 80 full grid.

Data Management: NICS All data archived to HPSS approaching 1 PB Workflow required complex

Data Management: NICS All data archived to HPSS approaching 1 PB Workflow required complex data movement: All model runs at high resolution done on Athena Model output stored on scratch or nakji and all copied to tape on HPSS IFS data interpolation/truncation done directly from retrieved HPSS files NICAM data processed using Verne and nakji (more capable CPUs and larger memory)

Data Management: COLA Athena allocated 50 TB (26%) on COLA disk servers. Required considerable

Data Management: COLA Athena allocated 50 TB (26%) on COLA disk servers. Required considerable discussion and judgment to down-select variables from IFS and NICAM, based on factors including scientific use and data compressibility. Large directory structure needed to organize the data, particularly IFS with many resolutions, subresolutions, data forms and ensemble members.

Data Management: Future New machines at COLA and NICS will permit further analysis not

Data Management: Future New machines at COLA and NICS will permit further analysis not currently possible due to lack of memory and compute power. Some or all of the data will be made publically available eventually when long term disposition is determined. Tera. Grid Science Portal? ? Earth System Grid? ?

Summary Large, international team of climate and computer scientists, using dedicated and shared resources,

Summary Large, international team of climate and computer scientists, using dedicated and shared resources, introduces many challenges for production computing, data analysis and data management The shear volume and the complexity of the data, “breaks” everything: Disk capacity File name space Bandwidth connecting systems within NICS HPSS tape capacity Bandwidth to remote sites for collaborating groups Software for analysis and display of results (Gr. ADS modifications) COLA overcame these difficulties as they were encountered in 24× 7 production mode and prevent having an idle dedicated computer.