COMP 73307336 Advanced Parallel and Distributed Computing Dr

COMP 7330/7336 Advanced Parallel and Distributed Computing Dr. Xiao Qin Auburn University http: //www. eng. auburn. edu/~xqin@auburn. edu

Your Background Not-A-Quiz • • Have you taken the parallel and distributed computing (PDC) class? What lab assignments have you completed in the PDC class? What is your on-going dissertation or thesis research project? Is your current research project related to parallel and distributed computing?

Today’s Goal: • Course Objectives • Course Content & Grading • Answer your questions about COMP 7330/7336 • Provide you a sense of the trends that shape the parallel and distributed computing field

COMP 7330/7336: Semester Calendar See the class webpage for the most up to date version! http: //www. eng. auburn. edu/~xqin/courses/comp 7330

Will it be worthwhile?

Textbook Ananth Grama, Anshul Gupta, George Karypis, Vinpin Kumar, “Introduction to Parallel Computing”, Second Edition, Pearson, ISBN 0 -201 -64865 -2

Topic Coverage • Handouts, book chapters, and papers will be used as supplement course material. The course material will be posted online. • Covers (These topics may change) – Advances of technologies and system architectures for parallel and distributed computing – Parallel computing algorithms – Parallel programming models – Message Passing Interface – Convergence of parallel, distributed and cloud computing – Analytical modeling and system evaluation – Cloud Computing and Big Data

Course Syllabus • Prerequisite: – COMP 4300, Computer Architecture – COMP 4320, Introduction to Computer Networks • Midterm exam and Final exam • Grading – Class Participation 10% – Midterm 20% – Final 20% – Homework Assignment 20% – Research Projects 30%

Course Syllabus (cont. ) • Scale – Letter grades will be awarded based on the following scale. This scale may be adjusted upwards if it is necessary based on the final grades. – A 90 B 80 C 70 D 60 F < 60

Office Hours and Exams Office hours: MWF 10: 55 am-11: 55 am Mid-term W 10/9/2013

Am I going to read papers to you? • NO! • Papers provides a framework and complete background, so lectures can be more interactive. – You do the reading – We’ll discuss it • Projects will go “beyond”

Questions Please ask at any time!

The History 1969: CDC 6600 1 st system for scientific computing 1975: CDC 7600 1 st supercomputer 1985: Cray X-MP / 4 8 1 st vector supercomputer 1989: Cray Y-MP / 4 64 1993: Cray C-90 / 2 128 1994: Cray T 3 D 64 1 st parallel supercomputer 1995: Cray T 3 D 128 1998: Cray T 3 E 256 1 st MPP supercomputer 2002: IBM SP 4 512 1 Teraflops 2005: IBM SP 5 512 2006: IBM BCX 10 Teraflops 2009: IBM SP 6 100 Teraflops 2012: IBM BG/Q > 1 Petaflops Source: scc. acad. bg/ncsa/documentation/Presentations/Gi. Erbacci_slides. ppt

HPC Infrastructure for Scientific Which machine is the computing most energy efficient? Logical Name SP 6 (Sep 2009) BGP (jan 2010) PLX (2011) IBM P 575 IBM BG / P IBM IDATAPLEX SMP MPP Linux Cluster IBM Power 6 4. 7 Ghz IBM Power. PC 0, 85 GHz Intel Westmere Ec 2. 4 Ghz # of core 5376 4096 3288 + 548 GPGPU Nvidia Fermi M 2070 # of node 168 32 274 # of rack 12 1 10 Total RAM 20 Tera Byte 2 Tera Byte ~ 13 Tera Byte IBM 3 D Torus Qlogiq QDR 4 x Suse Red. Hat ~ 800 Kwatts ~ 80 Kwatts ~ 200 Kwatts > 101 Tera Flops ~ 14 Tera Flops ~ 300 Tera Flops Model Architecture Processor Interconnecti Qlogic Infiniband DDR 4 x on Operating AIX System Total Power Peak Performance

SP Power 6 @ CINECA - 168 compute nodes IBM p 575 Power 6 (4. 7 GHz) - 5376 compute cores (32 core / node) - 128 Gbyte RAM / node (21 Tbyte RAM) - IB x 4 DDR (double data rate) Peak performance 101 TFlops Rmax 76. 41 Tflop/s Efficiency (workload) 75. 83 % N. 116 Top 500 (June 11) - 2 login nodes IBM p 560 - 21 I/O + service nodes IBM p 520 - 1. 2 PByte Storage row: 500 Tbyte working area High Performance 700 Tbyte data repository

BGP @ CINECA Model: IBM Blue. Gene / P Architecture: MPP Processor Type: IBM Power. PC 0, 85 GHz Compute Nodes: 1024 (quad core, 4096 total) RAM: 4 GB/compute node (4096 GB total) Internal Network: IBM 3 D Torus OS: Linux (login nodes) CNK (compute nodes) Peak Performance: 14. 0 TFlop/s

PLX @ CINECA IBM Server dx 360 M 3 – Compute node 2 x processori Intel Westmere 6 c X 5645 2. 40 GHz 12 MB Cache, DDR 3 1333 MHz 80 W 48 GB RAM su 12 DIMM 4 GB DDR 3 1333 MHz 1 x HDD 250 GB SATA 1 x QDR Infiniband Card 40 Gbs 2 x NVIDIA m 2070 (m 2070 q su 10 nodi) Peak performance 32 TFlops (3288 cores a 2. 40 GHz) Peak performance 565 TFlops Single Precision o 283 TFlops Double Precision (548 Nvidia M 2070) N. 54 Top 500 (June 11)

Visualisation system • Visualisation and computer graphycs • Virtual Theater • 6 video-projectors BARCO SIM 5 • Audio surround system • Cylindric screen 9. 4 x 2. 7 m, angle 120° • Ws + Nvidia cards • RVN nodes on PLX system

Storage Infrastructure System Available bandwidth (GB/s) Space (TB) Connection Tecnology Disk Tecnology 2 x S 2 A 9500 3, 2 140 FCP 4 Gb/s FC 4 x S 2 A 9500 3, 2 140 FCP 4 Gb/s FC 6 x DCS 9900 5, 0 540 FCP 8 Gb/s SATA 4 x DCS 9900 5, 0 720 FCP 4 Gb/s SATA 3 x DCS 9900 5, 0 1500 FCP 4 Gb/s SATA Hitachi Ds 3, 2 360 FCP 4 Gb/s SATA 3 x SFA 1000 10, 0 2200 QDR SATA 1 x IBM 5100 3, 2 66 FCP 8 Gb/s FC > 5, 6 PB

HPC Evolution Moore’s law is holding, in the number of transistors – Transistors on an ASIC still doubling every 18 months at constant cost – 15 years of exponential clock rate growth has ended Moore’s Law reinterpreted – Performance improvements are now coming from the increase in the number of cores on a processor (ASIC) – #cores per chip doubles every 18 months instead of clock – 64 -512 threads per node will become visible soon

Real HPC Crisis is with ____? A supercomputer application and software usually much more long-lived than a hardware - Hardware life typically four-five years at most. - Fortran and C are still the main programming models Programming is stuck - Arguably hasn’t changed so much since the 70’s Software is a major cost component of modern technologies. - The tradition in HPC system procurement is to assume that the software is free.

It’s time for a change • Complexity is rising dramatically • Challenges for the applications on Petaflop systems • The use of O(100 K) cores implies dramatic optimization effort • New paradigm as the support of a hundred threads in one node implies new parallelization strategies

It’s time for a change (cont. ) • Improvement of existing codes will become complex and partly impossible • Implementation of new parallel programming methods in existing large applications has not always a promising perspective • There is the need for new community codes

Roadmap to Exascale (architectural trends)