Scaling Tightly Coupled Algorithms on AWS Dr Scott

  • Slides: 42
Download presentation
Scaling Tightly Coupled Algorithms on AWS Dr. Scott Eberhardt Principle Solutions Architect – HPC,

Scaling Tightly Coupled Algorithms on AWS Dr. Scott Eberhardt Principle Solutions Architect – HPC, AWS Visiting Reader, Imperial College Research Computing @ AWS Worldwide Research & Technical Computing

IT’S ABOUT SCIENCE, NOT SERVERS. © 2019, Amazon Web Services, Inc. or its Affiliates.

IT’S ABOUT SCIENCE, NOT SERVERS. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. aws. amazon. com/rcp #AWSresearchcloud

Unlimited infrastructure Efficient clusters Low cost with flexible pricing Why AWS for HPC? Faster

Unlimited infrastructure Efficient clusters Low cost with flexible pricing Why AWS for HPC? Faster time to results Increased collaboration Concurrent clusters on-demand © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Ø Ø Ø Ø © 2019, Amazon Web Services, Inc. or its Affiliates. All

Ø Ø Ø Ø © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Ø Run many Jobs in Parallel Ø Eliminate HPC resource contention Ø Eliminate queue

Ø Run many Jobs in Parallel Ø Eliminate HPC resource contention Ø Eliminate queue wait Ø Use it when you need it Ø Right-size clusters and resources Ø Optimize each workload for performance Ø Pay for only what you use © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

On Premises Capital Expense Model Amazon Web Services Pay As You Go Model §

On Premises Capital Expense Model Amazon Web Services Pay As You Go Model § High upfront capital cost § High cost of ongoing support § Use only what you need § Multiple pricing models © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Genomics Processing Modeling and Simulation Government and Educational Research Monte Carlo Simulations Transcoding and

Genomics Processing Modeling and Simulation Government and Educational Research Monte Carlo Simulations Transcoding and Encoding Computational Chemistry … and many more © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Clustered (Tightly coupled) Fluid dynamics § Weather forecasting § Materials simulations § Crash simulations

Clustered (Tightly coupled) Fluid dynamics § Weather forecasting § Materials simulations § Crash simulations § Data Light Minimal requirements for high performance storage Risk modeling § Molecular modeling § Contextual search § Logistics simulations § Seismic processing § Metagenomics § Astrophysics § Deep learning § Animation and VFX § Semiconductor verification § Image processing/GIS § Genomics § Distributed / Grid (Loosely coupled) © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Heavy Benefits from access to high performance storage

Global Infrastructure © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Global Infrastructure © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

§ Compute performance – CPUs, GPUs, FPGAs § Memory performance – high RAM requirements

§ Compute performance – CPUs, GPUs, FPGAs § Memory performance – high RAM requirements in many applications § Network performance – throughput, latency, and consistency § Storage performance – including shared filesystems § Automation and cluster/job management § Remote graphics for interactive applications § ISV support – including license management …and SCALE © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Computing Credit: Aristotle © 2019, Amazon Web Services, Inc. or its Affiliates. All rights

Computing Credit: Aristotle © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Cores Data Centre Capacity Limit Time (days) © 2019, Amazon Web Services, Inc. or

Cores Data Centre Capacity Limit Time (days) © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Time (days)

Running at the same time, and tuned for each workload © 2019, Amazon Web

Running at the same time, and tuned for each workload © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

100 Gbps © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

100 Gbps © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Example in aerospace § Running parallel CFD studies using Siemens STAR-CCM+ § Goal: shorten

Example in aerospace § Running parallel CFD studies using Siemens STAR-CCM+ § Goal: shorten the time between Design Requirements and Configuration, and Flight Testing § 1000+ cores per CFD study, multiple studies required for each workflow iteration § Job-level optimizations: § Enhanced Networking, Placement Groups § Amazon Linux, Hyper-threading disabled § Workflow optimizations: § Spot instances, multiple clusters § Multiple parallel studies for faster throughput © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

For tightly-coupled cluster workloads Test using real-world examples MPI libraries § § Use large

For tightly-coupled cluster workloads Test using real-world examples MPI libraries § § Use large cases for testing: do not benchmark scalability using only small examples Test with Intel MPI and Open. MPI 4. 0, and make use of available tunings Domain decomposition Network § § Use a placement group § Enable enhanced networking Choose number of cells per core for either pre-core efficiency or faster results Processor states § Use P-states to reduce processor variability © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Hyper-threading and affinity § Test with Hyper-threading (HT) on and off – usually off is best, but not always § Use CPU affinity to pin threads to CPU cores when HT is off

Scaling © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Scaling © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

 © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

§ C 4. 8 xlarge instance type § 140 M cell model § F

§ C 4. 8 xlarge instance type § 140 M cell model § F 1 car CFD benchmark © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

r © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

r © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

§ – – © 2019, Amazon Web Services, Inc. or its Affiliates. All rights

§ – – © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

 • • • © 2019, Amazon Web Services, Inc. or its Affiliates. All

• • • © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

1. 20 E-07 sec/time-step/cell 1. 00 E-07 8. 00 E-08 z 1 d (medium

1. 20 E-07 sec/time-step/cell 1. 00 E-07 8. 00 E-08 z 1 d (medium mesh) z 1 d (fine Mesh) 6. 00 E-08 c 5 n (medium mesh) c 5 n (fine mesh) 4. 00 E-08 Archer (medium mesh) Archer (fine mesh) 2. 00 E-08 0. 00 E+005. 00 E+021. 00 E+031. 50 E+032. 00 E+032. 50 E+03 © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Cores

1. 80 E+01 1. 60 E+01 1. 40 E+01 sec/time-step 1. 20 E+01 z

1. 80 E+01 1. 60 E+01 1. 40 E+01 sec/time-step 1. 20 E+01 z 1 d (medium) 1. 00 E+01 z 1 d (fine) 8. 00 E+00 c 5 n (medium) c 5 n (fine) 6. 00 E+00 Archer (medium) 4. 00 E+00 Archer (fine) 2. 00 E+00 5 E+02 © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 1 E+03 2 E+03 Cores 2 E+03 3 E+03

1. 80 E+01 sec/time-step 1. 60 E+01 1. 40 E+01 1. 20 E+01 z

1. 80 E+01 sec/time-step 1. 60 E+01 1. 40 E+01 1. 20 E+01 z 1 d (medium) 1. 00 E+01 z 1 d (fine) 8. 00 E+00 c 5 n (medium) 6. 00 E+00 c 5 n (fine) Archer (medium) 4. 00 E+00 Archer (fine) 2. 00 E+00 1 E+05 2 E+05 3 E+05 4 E+05 5 E+05 6 E+05 7 E+05 Cells/core © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Scaling © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Scaling © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

 © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

9. 00 E-08 sec/time-step/cell 8. 00 E-08 7. 00 E-08 6. 00 E-08 z

9. 00 E-08 sec/time-step/cell 8. 00 E-08 7. 00 E-08 6. 00 E-08 z 1 d (480 cores) z 1 d (960) 5. 00 E-08 z 1 d (1920) 4. 00 E-08 c 5 n (480) c 5 n (960) 3. 00 E-08 c 5 n (1920) Archer (960) 2. 00 E-08 Archer (1920) 1. 00 E-08 0. 00 E+00 1. 00 E+08 1. 20 E+08 1. 40 E+08 1. 60 E+08 1. 80 E+08 2. 00 E+08 2. 20 E+08 2. 40 E+08 2. 60 E+08 2. 80 E+08 3. 00 E+08 © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Cells

1. 80 E+01 1. 60 E+01 1. 40 E+01 sec/time-step 1. 20 E+01 z

1. 80 E+01 1. 60 E+01 1. 40 E+01 sec/time-step 1. 20 E+01 z 1 d (480 cores) z 1 d (960) 1. 00 E+01 z 1 d (1920) 8. 00 E+00 c 5 n (480) c 5 n (960) 6. 00 E+00 c 5 n (1920) Archer (960) 4. 00 E+00 Archer (1920) 2. 00 E+00 0. 00 E+00 1. 00 E+08 1. 20 E+08 1. 40 E+08 1. 60 E+08 1. 80 E+08 2. 00 E+08 2. 20 E+08 2. 40 E+08 2. 60 E+08 2. 80 E+08 3. 00 E+08 © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Cells

sec/time-step/cell 1. 20 E-07 1. 00 E-07 8. 00 E-08 z 1 d (medium

sec/time-step/cell 1. 20 E-07 1. 00 E-07 8. 00 E-08 z 1 d (medium mesh) z 1 d (fine Mesh) 6. 00 E-08 c 5 n (medium mesh) c 5 n (fine mesh) 4. 00 E-08 Archer (medium mesh) Archer (fine mesh) 2. 00 E-08 0. 00 E+00 5. 00 E-04 1. 00 E-03 1. 50 E-03 2. 00 E-03 2. 50 E-03 1/Cores © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

6. 00 E+04 lls/sec/core (workload/sec) 5. 00 E+04 4. 00 E+04 z 1 d

6. 00 E+04 lls/sec/core (workload/sec) 5. 00 E+04 4. 00 E+04 z 1 d (medium) z 1 d (fine) 3. 00 E+04 c 5 n (medium) c 5 n (fine) 2. 00 E+04 Archer (medium) Archer (fine) 1. 00 E+04 0. 00 E+00 1 E+05 2 E+05 © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 3 E+05 4 E+05 5 E+05 6 E+05 Cells/Cores (workload) 7 E+05

Iteration Time (s) 35 30 25 z 1 d 20 m 5 15 10

Iteration Time (s) 35 30 25 z 1 d 20 m 5 15 10 C 5 n 5 0 0. 00 E+00 1. 00 E+06 2. 00 E+06 Cells/Core © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 3. 00 E+06

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

§ – – – © 2019, Amazon Web Services, Inc. or its Affiliates. All

§ – – – © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

§ § § © 2019, Amazon Web Services, Inc. or its Affiliates. All rights

§ § § © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

missing manual Written by Amazon’s Research Computing community for scientists. • Explains foundational concepts

missing manual Written by Amazon’s Research Computing community for scientists. • Explains foundational concepts about how AWS can accelerate time-to-science in the cloud. • Step-by-step best practices for securing your environment to ensure your research data is safe and your privacy is protected. • Tools for budget management that will help you control your spending and limit costs (and preventing any over-runs). • Catalogue of scientific solutions from partners chosen for their outstanding work with scientists. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. aws. amazon. com/rcp

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.