Introduccin a la computacin de altas prestaciones Francisco

Introducción a la computación de altas prestaciones Francisco Almeida y Francisco de Sande Departamento

Questions m Why Parallel Computers? m How Can the Quality of the Algorithms be

OUTLINE m Introduction to Parallel Computing m Performance Metrics m Models of Parallel Computers

Why Parallel Computers ? m Applications Demanding Computational Power: more 1990 s 1980 s

Top 500 m www. top 500. org Introducción a la Computación de Altas Prestaciones.

Performace Metrics Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero

Speed-up m Ts = Sequential Run Time: Time elapsed between the begining and the

Speed-up Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de

Speed-up Optimal Number of Processors Introducción a la Computación de Altas Prestaciones. La Laguna,

Efficiency m In practice, ideal behavior of an speed-up equal to p is not

Efficiency Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de

Amdahl`s Law m Amdahl`s law attempt to give a maximum bound for speed-up from

Example m A problem to be solved many times over several different inputs. n

Models Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de

The Sequential Model RAM m The RAM model express computations on von Neumann architectures.

The Parallel Model PRAM m Computational Models BSP, Log. P PVM, MPI, HPF, Threads,

Address-Space Organization Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero

Digital Alpha. Server 8400 Hardware • Shared Memory • Bus. Topology • C 4

SGI Origin 2000 Hardware • Shared Dsitributed Memory • Hypercubic Topology • C 4

The SGI Origin 3000 Architecture (1/2) jen 50. ciemat. es n 160 processors MIPS

The SGI Origin 3000 Architecture (2/2) n cc-Numa memory Architecture n 1 Gflops Peak

Beowulf Computers • COTS: Commercial-Off-The-Shelf computers • Distributed Memory Introducción a la Computación de

Towards Grid Computing…. Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de

Drawbacks that arise when solving Problems using Parallelism m Parallel Programming complex than sequential.

The Message Passing Model Introducción a la Computación de Altas Prestaciones. La Laguna, 12

The Message Passing Model processor Interconnection Network processor Send(parameters) Recv(parameters) Introducción a la Computación

Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

MPI pvm CMMD Express Zipcode p 4 PARMACS EUI MPI Parallel Libraries Parallel Applications

MPI • What Is MPI? • Message Passing Interface standard • The first standard

A Simple MPI Program MPI hello. c #include <stdio. h> #include "mpi. h" main(int

A Simple MPI Program MPI helloms. c #include <stdio. h> #include <string. h> #include

Linear Model to Predict Communication Performance Time to send N bytes= n + b

Performace Prediction: Fast, Gigabit Ethernet, Myrinet Introducción a la Computación de Altas Prestaciones. La

Basic Communication Operations Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de

One-to-all broadcast Single-node Accumulation One-to-all broadcast M 0 1 . . . p 0

Broadcast on Hypercubes First Step 6 7 2 4 6 3 2 5 0

Broadcast on Hypercubes Third Step 6 7 2 4 6 3 5 0 7

MPI Broadcast m int MPI_Bcast( void *buffer; int count; MPI_Datatype datatype; int root; MPI_Comm

Reduction on Hypercubes m @ conmutative associative operator A 2 101 m Ai in

Reductions with MPI m int MPI_Reduce( void int MPI_Datatype MPI_Op int MPI_Comm ); m

All-To-All Broadcast Multinode Accumulation M 1 M 2 0 1 Mp All-to-all broadcast .

MPI Collective Operations MPI Operator Operation -------------------------------MPI_MAX maximum MPI_MIN minimum MPI_SUM sum MPI_PROD product

Computing : Sequential 1 p= 4 dx (1+x 2) 0 double t, pi=0. 0,

Computing : Parallel 1 p= 4 dx (1+x 2) MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);

The Master Slave Paradigm Master Slaves Introducción a la Computación de Altas Prestaciones. La

Condor University Wisconsin-Madison. www. cs. wisc. edu/condor m A problem to be solved many

Condor m Condor is a specialized workload management system for computeintensive jobs. m Like

What is Open. MP? m Application Program Interface (API) for shared memory parallel programming

Open. MP is not. . . m Not Automatic parallelization n User explicitly specifies

Why Open. MP? m Parallel programming landscape before Open. MP n Standard way to

Why Open. MP? (cont. ) m Commercial users, high end software vendors have big

The Open. MP API m Multi-platform shared-memory parallel programming n Open. MP is portable:

The Open. MP API m Single source parallel/serial programming: n Open. MP is not

Threads Multithreading: n Sharing a single CPU between multiple tasks (or "threads") in a

Open. MP Overview: How do threads interact? m Open. MP is a shared memory

Open. MP Parallel Computing Solution Stack System layer Prog. Layer (Open. MP API) User

Reasoning about programming m Programming is a process of successive refinement of a solution

Layers of abstraction in Programming Domain Model: Bridges between domains Problem Specification Algorithm Programming

The Open. MP Computational Model Open. MP was created with a particular abstract machine

The Open. MP programming model fork-join parallelism: n Master thread spawns a team of

So, How good is Open. MP? m A high quality programming environment supports transparent

What about the cost model? m Open. MP doesn’t say much about the cost

What about the specification model? m Programmers reason in terms of a specification model

Is Open. MP a “good” API? Model: Bridges between domains Specification Fair (5) Programming

Open. MP today Hardware Vendors 3 rd Party Software Vendors Compaq/Digital (DEC) Absoft Hewlett-Packard

The Open. MP components m Directives m Environment Variables m Shared / Private Variables

The Open. MP directives m Directives are special comments in the language n Fortran

The Open. MP directives m Look like a comment (sentinel / pragma syntax) n

The Open. MP environment variables m OMP_NUM_THREADS - number of threads to run in

Shared / Private variables m Shared variables can be accessed by all of the

Open. MP Runtime routines m Writing a parallel section of code is matter of

Os ‘threads’ m In the case of Linux, it needs to be installed with

A simple example: computing Introducción a la Computación de Altas Prestaciones. La Laguna, 12

Computing double t, pi=0. 0, w; long i, n = 10000; double local, pi

Computing m #pragma omp directives in C n Ignored by non-Open. MP compilers double

Computing on a Sun. Fire 6800 Introducción a la Computación de Altas Prestaciones. La

Compiling Open. MP programs m Open. MP directives are ignored by default n Example:

Fortran example m program f 77_parallel implicit none integer n, m, i, j parameter

Fortran example program f 77_parallel implicit none integer n, m, i, j parameter (n=10,

Fortran example m With default scheduling, Thread a works on j = 1: 5

The Foundations of Open. MP: a parallel programming API Parallelism Working with concurrency Layers

Summary of Open. MP Basics m Parallel Region (to create threads) n C$omp parallel

Improvements in black hole detection using parallelism Francisco Almeida 1, Evencio Mediavilla 2, Álex

Introduction (1/3) m Very frequently there is a divorce between computer scientists and researchers

Introduction (2/3) m The IAC co-authors initially developed a Fortran 77 code solving the

Introduction (3/3) m One of our constraints was to introduce the minimum amount of

Outline m The Problem n Black holes and quasars n The method: gravitational lensing

Black holes m Supermassive black holes (SMBH) are supposed to exist in the nucleus

The accretion disk Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de

Quasars (Quasi Stellar Objects, QSO) m QSOs are currently believed to be the most

The method m We are interested in objects of dimensions comparable to the Solar

Gravitational Microlensing m Gravitational lensing (the attraction of light by matter) was predicted by

Microlensing Quasar-Star MACROIMAGES MICROIMAGES QUASAR OBSERVER Introducción a la Computación de Altas Prestaciones. La

Microlensing m The phenomenon is more complex because the magnification is not due to

Microlensing Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de

Double Microlens Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero

Multiple Microlens m Magnification pattern in the source plane, produced by a dense field

Q 2237+0305 (1/2) m So far the best example of a microlensed quasar is

Q 2237+0305 (2/2) Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de

Fluctuations in Q 2237+0305 light curves m Lightcurves of the four images of Q

Q 2237+0305 m In Q 2237+0305, and thanks to the unusually small distance between

Fluctuations in light curves m The curve representing the change in luminosity of the

Q 2237+0305 light curves (2/2) m Our goal is to model light curves of

Mathematical formulation (1/5) m Leaving aside the physical meaning of the different variables, the

Mathematical formulation (2/5) m And G is the function: To speed up the computation,

Mathematical formulation (3/5) m Our goal is to estimate in the observed flux the

Mathematical formulation (4/5) m Specifically, to find the values of the 5 parameters that

Mathematical formulation (5/5) m To minimize we use the e 04 ccf NAG routine,

The sequential code program seq_black_hole double precision t 2(100), s(100), ne(100), fa(100), efa(100) common/data/t

Sequential Times m In a Sun Blade 100 Workstation running Solaris 5. 0 and

Loop transformation program seq_black_hole 2 implicit none parameter(m = 4). . . double precision

MPI & Open. MP (1/2) m In the last years Open. MP and MPI

MPI & Open. MP (2/2) m Each one of these two alternatives have both

MPI-Open. MP hybridization (1/2) m Hybrid codes match the architecture of SMP clusters m

MPI-Open. MP hybridization (2/2) m An hybrid code may provide better scalability m Or

MPI code program black_hole_mpi include 'mpif. h' implicit none parameter(m = 4). . .

Open. MP code program black_hole_omp implicit none parameter(m = 4). . . double precision

Hybrid MPI – Open. MP code program black_hole_mpi_omp double precision t 2(100), s(100), ne(100),

Computational results (m=4) n As we do not have exclusive mode access to the

Parallel execution time (m=4) Introducción a la Computación de Altas Prestaciones. La Laguna, 12

Speedup (m=4) Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero

Results for mixed MPI-Open. MP 32 16 8 4 2 Introducción a la Computación

Computational results (m=5) Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de

Parallel execution time (m=5) Introducción a la Computación de Altas Prestaciones. La Laguna, 12

Speedup (m=5) Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero

Results for mixed MPI-Open. MP (m=5) 32 16 8 4 2 Introducción a la

Conclusions (1/3) m The computational results obtained from all the parallel versions confirm the

Conclusions (2/3) m The mixed MPI/Open. MP parallel version is the most expertise-demanding m

Conclusions (3/3) m We conclude a first step of cooperation in the way of

Conclusions and Future Work m The relevance of our results do not come directly

Supercomputing Centers m http: //www. ciemat. es/ m http: //www. cepba. upc. es/ m

MPI links m Message Passing Interface Forum http: //www. mpi-forum. org/ m MPI: The

Open. MP links m http: //www. openmp. org/ m http: //www. compunity. org/ Introducción

a r G ¡ a i c ! s Introducción a la computación de

Slides: 145

Download presentation

Introducción a la computación de altas prestaciones Francisco Almeida y Francisco de Sande Departamento de Estadística, I. O. y Computación Universidad de La Laguna, 12 de febrero de 2004 Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Questions m Why Parallel Computers? m How Can the Quality of the Algorithms be Analyzed? m How Should Parallel Computers Be Programmed? m Why the Message Passing Programming Paradigm? m Why de Shared Memory Programming Paradigm? Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

OUTLINE m Introduction to Parallel Computing m Performance Metrics m Models of Parallel Computers m The MPI Message Passing Library m Examples m The Open. MP Shared Memory Library m Examples n Improvements in black hole detection using parallelism Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Why Parallel Computers ? m Applications Demanding Computational Power: more 1990 s 1980 s n Artificial Intelligence n Weather Prediction n Biosphere Modeling n Processing of Large Amounts of Data (from sources such as satellites) n Combinatorial Optimization n Image Processing n Neural Network 1970 s Performance 1960 s Cost SUPERCOMPUTERS n Speech Recognition n Natural Language Understanding n etc. . Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Top 500 m www. top 500. org Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Performace Metrics Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Speed-up m Ts = Sequential Run Time: Time elapsed between the begining and the end of its execution on a sequential computer. m Tp = Parallel Run Time: Time that elapses from the moment that a parallel computation starts to the moment that the last processor finishes the execution. m Speed-up: T*s / Tp ? p m T*s = Time of the best sequential algorithm to solve the problem. Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Speed-up Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Speed-up Optimal Number of Processors Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Efficiency m In practice, ideal behavior of an speed-up equal to p is not achieved because while executing a parallel algorithm, the processing elements cannot devote 100% of their time to the computations of the algorithm. m Efficiency: Measure of the fraction of time for which a processing element is usefully employed. m E = (Speed-up / p) x 100 % Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Efficiency Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Amdahl`s Law m Amdahl`s law attempt to give a maximum bound for speed-up from the nature of the algorithm chosen for the parallel implementation. m Seq = Proportion of time the algorithm needs to be spent in purely sequential parts. m Par = Proportion of time that might be done in parallel m Seq + Par = 1 (where 1 is for algebraic simplicity) m Maximum Speed-up = (Seq + Par) / (Seq + Par / p) = 1 / (Seq + Par / p) % Seq Par Maximum Speed-up 0, 1 0, 001 0, 999 500, 25 0, 005 0, 995 166, 81 1 0, 01 0, 99 90, 99 10 0, 1 0, 9 9, 91 p = 1000 Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Example m A problem to be solved many times over several different inputs. n Evaluate F(x, y, z) l x in {1 , . . . , 20}; y in {1 , . . . , 10}; z in {1 , . . . , 3}; n The total number of evaluations is 20*10*3 = 600. n The cost to evaluate F in one point (x, y, z) is t. n The total running time is t * 600. n If t is equal to 3 hours. l The total running time for the 600 evaluations is 1800 hours 75 days Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Speed-up Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Models Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

The Sequential Model RAM m The RAM model express computations on von Neumann architectures. m The von Neumann architecture is universally accepted for sequential computations. Von Neumann Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

The Parallel Model PRAM m Computational Models BSP, Log. P PVM, MPI, HPF, Threads, OPen. MP m Programming Models m Architectural Models Parallel Architectures Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Address-Space Organization Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Digital Alpha. Server 8400 Hardware • Shared Memory • Bus. Topology • C 4 -CEPBA • 10 Alpha processors 21164 • 2 Gb Memory • 8, 8 Gflop/s Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

SGI Origin 2000 Hardware • Shared Dsitributed Memory • Hypercubic Topology • C 4 -CEPBA • 64 R 1000 processos • 8 Gb memory • 32 Gflop/s Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

The SGI Origin 3000 Architecture (1/2) jen 50. ciemat. es n 160 processors MIPS R 14000 / 600 MHz n On 40 nodes with 4 processors each n Data and instruction cache on-chip n Irix Operating System n Hypercubic Network Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

The SGI Origin 3000 Architecture (2/2) n cc-Numa memory Architecture n 1 Gflops Peak Speed n 8 MB external cache n 1 Gb main memory each proc. n 1 Tb Hard Disk Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Beowulf Computers • COTS: Commercial-Off-The-Shelf computers • Distributed Memory Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Towards Grid Computing…. Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004 Source: www. globus. org & updated

The Parallel Model PRAM m Computational Models BSP, Log. P PVM, MPI, HPF, Threads, Open. MP m Programming Models m Architectural Models Parallel Architectures Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Drawbacks that arise when solving Problems using Parallelism m Parallel Programming complex than sequential. is more m Results may vary as a consequence of the intrinsic non determinism. m New problems. starvation. . . Deadlocks, m Is more difficult to debug parallel programs. m Parallel programs are less portable. Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

The Message Passing Model Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

The Message Passing Model processor Interconnection Network processor Send(parameters) Recv(parameters) Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

MPI pvm CMMD Express Zipcode p 4 PARMACS EUI MPI Parallel Libraries Parallel Applications Parallel Languages Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

MPI • What Is MPI? • Message Passing Interface standard • The first standard and portable message passing library with good performance • "Standard" by consensus of MPI Forum participants from over 40 organizations • Finished and published in May 1994, updated in June 1995 • What does MPI offer? • Standardization - on many levels • Portability - to existing and new systems • Performance - comparable to vendors' proprietary libraries • Richness - extensive functionality, many quality implementations Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

A Simple MPI Program MPI hello. c #include <stdio. h> #include "mpi. h" main(int argc, char*argv[]) { int name, p, source, dest, tag = 0; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &name); MPI_Comm_size(MPI_COMM_WORLD, &p); $> mpicc -o hello. c $> mpirun -np 4 hello Hello from processor 2 Hello from processor 3 Hello from processor 1 Hello from processor 0 printf(“Hello from processor %d of %dn", name, p); MPI_Finalize(); } Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004 of of 4 4

A Simple MPI Program MPI helloms. c #include <stdio. h> #include <string. h> #include "mpi. h" main(int argc, char*argv[]) { int name, p, source, dest, tag = 0; char message[100]; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &name); MPI_Comm_size(MPI_COMM_WORLD, &p); $> mpirun Processor processor greetings -np 4 helloms 2 of 4 3 of 4 1 of 4 0, p = 4 from process 1! from process 2! from process 3! if (name != 0) { printf("Processor %d of %dn", name, p); sprintf(message, "greetings from process %d!", name); dest = 0; MPI_Send(message, strlen(message)+1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); } else { printf("processor 0, p = %d ", p); for(source=1; source < p; source++) { MPI_Recv(message, 100, MPI_CHAR, source, tag, MPI_COMM_WORLD, &status); printf("%sn", message); } } MPI_Finalize(); } Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Linear Model to Predict Communication Performance Time to send N bytes= n + b Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Performace Prediction: Fast, Gigabit Ethernet, Myrinet Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Basic Communication Operations Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

One-to-all broadcast Single-node Accumulation One-to-all broadcast M 0 1 . . . p 0 1 M M Single-node Accumulation . . . 0 2 Step 2 1 Step 1 . . . Step p p Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004 p M

Broadcast on Hypercubes First Step 6 7 2 4 6 3 2 5 0 6 4 1 7 2 4 0 Second Step 3 1 7 2 4 1 3 5 6 5 0 7 3 5 0 Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004 1

Broadcast on Hypercubes Third Step 6 7 2 4 6 3 5 0 7 2 4 1 3 5 0 1 Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

MPI Broadcast m int MPI_Bcast( void *buffer; int count; MPI_Datatype datatype; int root; MPI_Comm comm; ); m Broadcasts a message from the process with rank "root" to all other processes of the group Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Reduction on Hypercubes m @ conmutative associative operator A 2 101 m Ai in processor i m Every processor has obtain A 0@A 1@. . . @AP-1 A 6 110 and to A 7 101 A 3 101 A 5 101 A 0 000 A 6 @A 7@A 4@A 5 110 A 2 @A 3@ A 0@A 1 101 A 1 001 A 7 @A 6@A 5@A 4 101 A 2 @A 3 101 A 7 @A 6 101 A 3 @A 2@ A 1@A 0 101 A 7 @A 6@ A 5@A 4 101 A 0@A 1 @A 2 @A 3 000 A 6 @A 7 110 A 5@A 4 101 A 0@A 1 000 A 1@A 0 001 A 1@A 0@ A 3 @A 2 001 Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Reductions with MPI m int MPI_Reduce( void int MPI_Datatype MPI_Op int MPI_Comm ); m int MPI_Allreduce( *sendbuf; *recvbuf; count; datatype; op; root; comm; m Reduces values on all processes to a single value processes void *sendbuf; void int MPI_Datatype MPI_Op MPI_Comm ); *recvbuf; count; datatype; op; comm; m Combines values form all processes and distributes the result back to all Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

All-To-All Broadcast Multinode Accumulation M 1 M 2 0 1 Mp All-to-all broadcast . . . p Single-node Accumulation 0 M 1 1 M 0 M 1 Mp Mp . . . Reductions, Prefixsums Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004 p M 0 M 1 Mp

MPI Collective Operations MPI Operator Operation -------------------------------MPI_MAX maximum MPI_MIN minimum MPI_SUM sum MPI_PROD product MPI_LAND logical and MPI_BAND bitwise and MPI_LOR logical or MPI_BOR bitwise or MPI_LXOR logical exclusive or MPI_BXOR bitwise exclusive or MPI_MAXLOC max value and location MPI_MINLOC min value and location Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Computing : Sequential 1 p= 4 dx (1+x 2) 0 double t, pi=0. 0, w; long i, n =. . . ; double local, pi = 0. 0; . . 4 h = 1. 0 / (double) n; for (i = 0; i < n; i++) { x = (i + 0. 5) * h; pi += f(x); } pi *= h; 2 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Computing : Parallel 1 p= 4 dx (1+x 2) MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); 0 4 h = 1. 0 / (double) n; mypi = 0. 0; for (i = name; i < n; i += numprocs) { x = h * (i + 0. 5) *h; mypi+= f(x); } mypi = h * sum; 2 MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

The Master Slave Paradigm Master Slaves Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Condor University Wisconsin-Madison. www. cs. wisc. edu/condor m A problem to be solved many times over several different inputs. m The problem to be solved is computational expensive. Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Condor m Condor is a specialized workload management system for computeintensive jobs. m Like other full-featured batch systems, Condor provides a job queueing mechanism, scheduling policy, priority scheme, resource monitoring, and resource management. m Users submit their serial or parallel jobs to Condor, Condor places them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion. m Condor can be used to manage a cluster of dedicated compute nodes (such as a "Beowulf" cluster). In addition, unique mechanisms enable Condor to effectively harness wasted CPU power from otherwise idle desktop workstations. m In many circumstances Condor is able to transparently produce a checkpoint and migrate a job to a different machine which would otherwise be idle. m As a result, Condor can be used to seamlessly combine all of an organization's computational power into one resource. Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

What is Open. MP? m Application Program Interface (API) for shared memory parallel programming n What the application programmer inserts into code to make it run in parallel n Addresses only shared memory multiprocessors m Directive based approach with library support m Concept of base language and extensions to base language m Open. MP is available for Fortran 90 / 77 and C / C++ Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Open. MP is not. . . m Not Automatic parallelization n User explicitly specifies parallel execution n Compiler does not ignore user directives even if wrong m Not just loop level parallelism n Functionality to enable coarse grained parallelism m Not a research project n Only practical constructs that can be implemented with high performance in commercial compilers n Goal of parallel programming: application speedup m Simple/Minimal with Opportunities for Extensibility Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Why Open. MP? m Parallel programming landscape before Open. MP n Standard way to program distributed memory computers (MPI and PVM) n No standard API for shared memory programming m Several vendors had directive based APIs for shared memory programming n Silicon Graphics, Cray Research, Kuck & Associates, Digital Equipment …. n All different, vendor proprietary l Sometimes similar but with different spellings n Most were targeted at loop level parallelism l Limited functionality - mostly for parallelizing loops Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Why Open. MP? (cont. ) m Commercial users, high end software vendors have big investment in existing code n Not very eager to rewrite their code in new languages l Performance concerns of new languages m End result: users who want portability forced to program shared memory machines using MPI n Library based, good performance and scalability n But sacrifice the built in shared memory advantages of the hardware m Both require major investment in time and money n Major effort: entire program needs to be rewritten n New features needs to be curtailed during conversion Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

The Open. MP API m Multi-platform shared-memory parallel programming n Open. MP is portable: supported by Compaq, HP, IBM, Intel, SGI, Sun and others on Unix and NT m Multi-Language: C, C++, F 77, F 90 m Scalable m Loop Level parallel control m Coarse grained parallel control Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

The Open. MP API m Single source parallel/serial programming: n Open. MP is not intrusive (to original serial code). Instructions appear in comment statements for Fortran and through pragmas for C/C++ !$omp parallel do do i = 1, n. . . enddo #pragma omp parallel for (i = 0; I < n; i++) {. . . } m Incremental implementation: n Open. MP programs can be implemented incrementally one subroutine (function) or even one do (for) loop at a time Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Threads Multithreading: n Sharing a single CPU between multiple tasks (or "threads") in a way designed to minimise the time required to switch threads n This is accomplished by sharing as much as possible of the program execution environment between the different threads so that very little state needs to be saved and restored when changing thread. n Threads share more of their environment with each other than do tasks under multitasking. n Threads may be distinguished only by the value of their program counters and stack pointers while sharing a single address space and set of global variables Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Open. MP Overview: How do threads interact? m Open. MP is a shared memory model n Threads communicate by sharing variables m Unintended sharing of data causes race conditions: m race condition: when the program’s outcome changes as the threads are scheduled differently m To control race conditions: n Use synchronizations to protect data conflicts m Synchronization is expensive so: n Change how data is accessed to minimize the need for synchronization Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Open. MP Parallel Computing Solution Stack System layer Prog. Layer (Open. MP API) User layer End User Application Directives Open. MP Library Environment Variables Runtime Library OS/system support for shared memory Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Reasoning about programming m Programming is a process of successive refinement of a solution relative to a hierarchy of models m The models represent the problem at a different level of abstraction n The top levels express the problem in the original problem domain n The lower levels represent the problem in the computer’s domain m The models are informal, but detailed enough to support simulation Source: J. -M. Hoc, T. R. G. Green, R. Samurcay and D. J. Gilmore (eds. ), Psychology of Programming, Academic Press Ltd. , 1990 Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Layers of abstraction in Programming Domain Model: Bridges between domains Problem Specification Algorithm Programming Source Code Computational Open. MP only defines these two! Computation Cost Hardware Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

The Open. MP Computational Model Open. MP was created with a particular abstract machine or computational model in mind: n Multiple processing elements n A shared address space with “equal-time” access for each processor n Multiple light weight processes (threads) managed outside of Open. MP (the OS or some other “third party”) Proc 1 Proc 2 Proc 3 . . . Proc N Shared Address Space Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

The Open. MP programming model fork-join parallelism: n Master thread spawns a team of threads as needed n Parallelism is added incrementally: i. e. the sequential program evolves into a parallel program Master Thread Parallel Regions A nested Parallel Region Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

So, How good is Open. MP? m A high quality programming environment supports transparent mapping between models m Open. MP does this quite well for the models it defines: n Programming model: l. Threads forked by Open. MP map onto threads in modern OSs. n Computational model: l. Multiprocessor systems with cache coherency map onto Open. MP shared address space Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

What about the cost model? m Open. MP doesn’t say much about the cost model n programmers are left to their own devices m Real systems have memory hierarchies, Open. MP’s assumed machine model doesn’t: n Caches mean some data is closer to some processors n Scalable multiprocessor systems organize their RAM into modules - another source of NUMA m Open. MP programmers must deal with these issues as they: n Optimize performance on each platform n Scale to run onto larger NUMA machines Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

What about the specification model? m Programmers reason in terms of a specification model as they design parallel algorithms m Some parallel algorithms are natural in Open. MP: n Specification models implied by loop-splitting and SPMD algorithms map well onto Open. MP’s programming model m Some parallel algorithms are hard for Open. MP m Recursive problems and list processing is challenging for Open. MP’s models Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Is Open. MP a “good” API? Model: Bridges between domains Specification Fair (5) Programming Good (8) Computational Good (9) Cost Poor (3) Overall score: Open. MP is “OK” (6), but it sure could be better! Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Open. MP today Hardware Vendors 3 rd Party Software Vendors Compaq/Digital (DEC) Absoft Hewlett-Packard (HP) Edinburgh Portable Compilers (EPC) Kuck & Associates (KAI) IBM Myrias Intel Numerical Algorithms Group (NAG) Silicon Graphics Portland Group (PGI) Sun Microsystems Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

The Open. MP components m Directives m Environment Variables m Shared / Private Variables m Runtime Library m OS ‘Threads’ Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

The Open. MP directives m Directives are special comments in the language n Fortran fixed form: !$OMP, C$OMP, *$OMP n Fortran free form: !$OMP m Special comments are interpreted by Open. MP compilers Comment in Fortran but interpreted by Open. MP compilers w = 1. 0/n sum = 0. 0 !$OMP PARALLEL DO PRIVATE(x) REDUCTION(+: sum) do I=1, n x = w*(I-0. 5) sum = sum + f(x) end do pi = w*sum print *, pi Don’t worry about details now! end Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

The Open. MP directives m Look like a comment (sentinel / pragma syntax) n F 77/F 90 !$OMP directive_name [clauses] n C/C++ #pragma omp pragmas_name [clauses] m Declare start and end of multithread execution m Control work distribution m Control how data is brought into and taken out of parallel sections m Control how data is written/read inside sections Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

The Open. MP environment variables m OMP_NUM_THREADS - number of threads to run in a parallel section m MPSTKZ – size of stack to provide for each thread m OMP_SCHEDULE - Control state of scheduled executions. n setenv OMP_SCHEDULE "STATIC, 5“ n setenv OMP_SCHEDULE "GUIDED, 8“ n setenv OMP_SCHEDULE "DYNAMIC" m OMP_DYNAMIC m OMP_NESTED Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Shared / Private variables m Shared variables can be accessed by all of the threads. m Private variables are local to each thread m In a ‘typical’ parallel loop, the loop index is private, while the data being indexed is shared !$omp parallel do !$omp& shared(X, Y, Z), private(I) do I=1, 100 Z(I) = X(I) + Y(I) end do !$omp end parallel Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Open. MP Runtime routines m Writing a parallel section of code is matter of asking two questions n How many threads are working in this section? n Which thread am I? m Other things you may wish to know n How many processors are there? n Am I in a parallel section? n How do I control the number of threads? m Control the execution by setting directive state Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Os ‘threads’ m In the case of Linux, it needs to be installed with an SMP kernel. m Not a good idea to assign more threads than CPUs available: omp_set_num_threads(omp_get_num_procs()) Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

A simple example: computing Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Computing double t, pi=0. 0, w; long i, n = 10000; double local, pi = 0. 0, w = 1. 0 / n; . . . for(i = 0; i < n; i++) { t = (i + 0. 5) * w; pi += 4. 0/(1. 0 + t*t); } pi *= w; . . . Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Computing m #pragma omp directives in C n Ignored by non-Open. MP compilers double t, pi=0. 0, w; long i, n = 10000; double local, pi = 0. 0, w = 1. 0 / n; . . . #pragma omp parallel for reduction(+: pi) private(i, t) for(i = 0; i < n; i++) { t = (i + 0. 5) * w; pi += 4. 0/(1. 0 + t*t); } pi *= w; . . . Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Computing on a Sun. Fire 6800 Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Compiling Open. MP programs m Open. MP directives are ignored by default n Example: SGI Irix platforms l f 90 -O 3 foo. f l cc -O 3 foo. c m Open. MP directives are enabled with “-mp” n Example: SGI Irix platforms l f 90 -O 3 -mp foo. f l cc -O 3 -mp foo. c Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Fortran example m program f 77_parallel implicit none integer n, m, i, j parameter (n=10, m=20) integer a(n, m) m c$omp parallel default(none) c$omp& private(i, j) shared(a) do j=1, m do i=1, n a(i, j)=i+(j-1)*n enddo c$omp end parallel Open. MP directives used: c$omp parallel [clauses] c$omp end parallel Parallel clauses include: default(none|private|shared) private(. . . ) shared(. . . ) end Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Fortran example program f 77_parallel implicit none integer n, m, i, j parameter (n=10, m=20) integer a(n, m) c$omp parallel default(none) c$omp& private(i, j) shared(a) do j=1, m do i=1, n a(i, j)=i+(j-1)*n enddo c$omp end parallel m Each arrow denotes one thread m All threads perform identical task end Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Fortran example m With default scheduling, Thread a works on j = 1: 5 Thread b on j = 6: 10 Thread c on j = 11: 15 Thread d on j = 16: 20 program f 77_parallel implicit none integer n, m, i, j parameter (n=10, m=20) integer a(n, m) c$omp parallel default(none) c$omp& private(i, j) shared(a) do j=1, m do i=1, n a(i, j)=i+(j-1)*n enddo c$omp end parallel a b c d end Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

The Foundations of Open. MP: a parallel programming API Parallelism Working with concurrency Layers of abstractions or models used to understand use Open. MP Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Summary of Open. MP Basics m Parallel Region (to create threads) n C$omp parallel #pragma omp parallel m Worksharing (to split-up work between threads) n C$omp do #pragma omp for n C$omp sections #pragma omp sections n C$omp single #pragma omp single n C$omp workshare m Data Environment (to manage data sharing) n # directive: threadprivate n # clauses: shared, private, lastprivate, reduction, copyin, copyprivate m Synchronization n directives: critical, barrier, atomic, flush, ordered, master m Runtime functions/environment variables Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Improvements in black hole detection using parallelism Francisco Almeida 1, Evencio Mediavilla 2, Álex Oscoz 2 and Francisco de Sande 1 1 Departamento de Estadística, 2 I. O. y Computación Instituto de Astrofísica de Canarias Universidad de La Laguna Tenerife, Canary Islands, SPAIN Dresden, September 3 2003 Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Introduction (1/3) m Very frequently there is a divorce between computer scientists and researchers in other scientific disciplines m This work collects the experiences of a collaboration between researchers coming from two different fields: astrophysics and parallel computing m We present different approaches to the parallelization of a scientific code that solves an important problem in astrophysics: the detection of supermassive black holes Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Introduction (2/3) m The IAC co-authors initially developed a Fortran 77 code solving the problem m The execution time for this original code was not acceptable and that motivated them to contact with researchers with expertise in the parallel computing field m We know in advance that these scientist programmers deal with intense time-consuming sequential codes that are not difficult to tackle using HPC techniques m Researchers with a purely scientific background are interested in these techniques, but they are not willing to spend time learning about them Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Introduction (3/3) m One of our constraints was to introduce the minimum amount of changes in the original code m Even with the knowledge that some optimizations could be done in the sequential code m To preserve the use of the NAG functions was also another restriction in our development Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Outline m The Problem n Black holes and quasars n The method: gravitational lensing n Fluctuations in quasars light curves n Mathematical formulation of the problem m Sequential code m Parallelizations: MPI, Open. MP, Mixed MPI/Open. MP m Computational results m Conclusions Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Black holes m Supermassive black holes (SMBH) are supposed to exist in the nucleus of many if not all the galaxies m Some of these objects are surrounded by a disk of material continuously spiraling towards the deep gravitational potential pit of the SMBH and releasing huge quantities of energy giving rise to the phenomena known as quasars (QSO) Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

The accretion disk Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Quasars (Quasi Stellar Objects, QSO) m QSOs are currently believed to be the most luminous and distant objects in the universe m QSOs are the cores of massive galaxies with super giant black holes that devour stars at a rapid rate, enough to produce the amount of energy observed by a telescope Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

The method m We are interested in objects of dimensions comparable to the Solar System in galaxies very far away from the Milky Way m Objects of this size can not be directly imaged and alternative observational methods are used to study their structure m The method we use is the observation of QSO images affected by a microlensing event to study the structure of the accretion disk Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Gravitational Microlensing m Gravitational lensing (the attraction of light by matter) was predicted by General Relativity and observationally confirmed in 1919 m If light from a QSO pass through a galaxy located between the QSO and the observer it is possible that a star in the intervening galaxy crosses the QSO light beam m The gravitational field of the star amplifies the light emission coming from the accretion disk (gravitational microlensing) Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Microlensing Quasar-Star MACROIMAGES MICROIMAGES QUASAR OBSERVER Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Microlensing m The phenomenon is more complex because the magnification is not due to a single isolated microlens, but it rather is a collective effect of many stars m As the stars are moving with respect to the QSO light beam, the amplification varies during the crossing Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Microlensing Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Double Microlens Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Multiple Microlens m Magnification pattern in the source plane, produced by a dense field of stars in the lensing galaxy. m The color reflects the magnification as a function of the quasar position: the sequence blue-green-red-yellow indicates increasing magnification Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Q 2237+0305 (1/2) m So far the best example of a microlensed quasar is the quadruple quasar Q 2237+0305 Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Q 2237+0305 (2/2) Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Fluctuations in Q 2237+0305 light curves m Lightcurves of the four images of Q 2237+0305 over a period of almost ten years m The changes in relative brightness are very obvious Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Q 2237+0305 m In Q 2237+0305, and thanks to the unusually small distance between the observer and the lensing galaxy, the microlensing events have a timescale of the order of months m We have observed Q 2237+0305 from 1999 October during approximately 4 months at the Roque de los Muchachos Observatory Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Fluctuations in light curves m The curve representing the change in luminosity of the QSO with time depends on the position of the star and also on the structure of the accretion disk m Microlens-induced fluctuations in the observed brightness of the quasar contain information about the light-emitting source (size of continuum region or broad line region of the quasar, its brightness profile, etc. ) m Hence from a comparison between observed and simulated quasar microlensing we can draw conclusions about the accretion disk Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Q 2237+0305 light curves (2/2) m Our goal is to model light curves of QSO images affected by a microlensing event to study the unresolved structure of the accretion disk Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Mathematical formulation (1/5) m Leaving aside the physical meaning of the different variables, the function modeling the dependence of the observed flux with time t, can be written as: Where is the ratio between the outer and inner radii of the accretion disk (we will adopt ). Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Mathematical formulation (2/5) m And G is the function: To speed up the computation, G has been approximated by using MATHEMATICA Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Mathematical formulation (3/5) m Our goal is to estimate in the observed flux the values of the parameters , B , C, : , t 0 by fitting to the observational data Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Mathematical formulation (4/5) m Specifically, to find the values of the 5 parameters that minimize the error between theoretical model and the observational data according to a chi-square criterion: Where: n N is the number of data points corresponding to times ti (i =1, 2, . . . , N) n is theoretical function evaluated at time ti n is the observational error associated to each data value Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Mathematical formulation (5/5) m To minimize we use the e 04 ccf NAG routine, that only requires evaluation of the function and not of the derivatives m The determination of the minimum in the 5 -parameters space depends on the initial conditions, so m we consider a 5 -dimensional grid of starting points m and m sampling intervals in each variable m for each one of the points of the this grid we compute a local minimum m Finally, we select the absolute minimum among them Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

The sequential code program seq_black_hole double precision t 2(100), s(100), ne(100), fa(100), efa(100) common/data/t 2, fa, efa, length double precision t, fit(5) common/var/t, fit c c c Data input Initialize best solution do k 1=1, m do k 2=1, m do k 3=1, m do k 4=1, m do k 5=1, m Initialize starting point x(1), . . . , x(5) call jic 2(nfit, x, fx) call e 04 ccf(nfit, x, ftol, niw, w 1, w 2, w 3, w 4, w 5, w 6, jic 2, . . . ) if (fx improves best fx) then update(best (x, fx)) endif enddo enddo end Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Sequential Times m In a Sun Blade 100 Workstation running Solaris 5. 0 and using the native Fortran 77 Sun compiler (v. 5. 0) with full optimizations this code takes 5. 89 hours for sampling intervals of size m=4 12. 45 hours for sampling intervals of size m=5 m In a SGI Origin 3000 using the native MIPSpro F 77 compiler (v. 7. 4) with full optimizations 0. 91 hours for m=4 2. 74 hours for m=5 Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Loop transformation program seq_black_hole 2 implicit none parameter(m = 4). . . double precision t 2(100), s(100), ne(100), fa(100), efa(100) common/data/t 2, fa, efa, longitud double precision t, fit(5) integer k 1, k 2, k 3, k 4, k 5 common/var/t, fit c c Data input Initialize best solution do k = 1, m^5 Index transformation Initialize starting point x(1), . . . , x(5) call jic 2(nfit, x, fx) call e 04 ccf(nfit, x, ftol, niw, w 1, w 2, w 3, w 4, w 5, w 6, jic 2, . . . ) if (fx improves best fx) then update(best (x, fx)) endif enddo end seq_black_hole Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

MPI & Open. MP (1/2) m In the last years Open. MP and MPI have been universally accepted as the standard tools to develop parallel applications m Open. MP n is a standard for shared memory programming n It uses a fork-join model and n is mainly based on compiler directives that are added to the code that indicate the compiler regions of code to be executed in parallel m MPI n n Uses an SPMD model Processes can read and write only to their respective local memory Data are copied across local memories using subroutine calls The MPI standard defines the set of functions and procedures available to the programmer Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

MPI & Open. MP (2/2) m Each one of these two alternatives have both advantages and disadvantages m Very frequently it is not obvious which one should be selected for a specific code: n MPI programs run on both distributed and shared memory architectures while Open. MP run only on shared memory n The abstraction level is higher in Open. MP n MPI is particularly adaptable to coarse grain parallelism. Open. MP is suitable for both coarse and fine grain parallelism n While it is easy to obtain a parallel version of a sequential code in Open. MP, usually it requires a higher level of expertise in the case of MPI. A parallel code can be written in Open. MP just by inserting proper directives in a sequential code, preserving its semantics, while in MPI mayor changes are usually needed n Portability is higher in MPI Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

MPI-Open. MP hybridization (1/2) m Hybrid codes match the architecture of SMP clusters m SMP clusters are an increasingly popular platforms m MPI may suffer from efficiency problems in shared memory architectures m MPI codes may need too much memory m Some vendors have attempted to exted MPI in shared memory, but the result is not as efficient as Open. MP m Open. MP is easy to use but it is limited to the shared memory architecture Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

MPI-Open. MP hybridization (2/2) m An hybrid code may provide better scalability m Or simply enable a problem to exploit more processors m Not necessarily faster than pure MPI/Open. MP m It depends on the code, architecture, and how the programming models interact Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

MPI code program black_hole_mpi include 'mpif. h' implicit none parameter(m = 4). . . double precision t 2(100), s(100), ne(100), fa(100), efa(100) common/data/t 2, fa, efa, longitud double precision t, fit(5) common/var/t, fit c call MPI_INIT( ierr ) call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr ) call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr ) Data input Initialize best solution do k = myid, m^5 - 1, numprocs Initialize starting point x(1), . . . , x(5) call jic 2(nfit, x, fx) call e 04 ccf(nfit, x, ftol, niw, w 1, w 2, w 3, w 4, w 5, w 6, jic 2, . . . ) if (fx improves best fx) then update(best (x, fx)) endif enddo call MPI_FINALIZE( ierr ) end Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Open. MP code program black_hole_omp implicit none parameter(m = 4). . . double precision t 2(100), s(100), ne(100), fa(100), efa(100) common/data/t 2, fa, efa, longitud double precision t, fit(5) common/var/t, fit !$OMP THREADPRIVATE(/var/, /data/) c Data input c Initialize best solution !$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(tid, k, maxcal, ftol, ifail, w 1, w 2, w 3, w 4, w 5, w 6, x) COPYIN(/data/) LASTPRIVATE(fx) do k = 0, m^5 - 1 c Initialize starting point x(1), . . . , x(5) call jic 2(nfit, x, fx) call e 04 ccf(nfit, x, ftol, niw, w 1, w 2, w 3, w 4, w 5, w 6, jic 2, . . . ) if (fx improves best fx) then update(best (x, fx)) endif enddo !$OMP END PARALLEL DO end Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Hybrid MPI – Open. MP code program black_hole_mpi_omp double precision t 2(100), s(100), ne(100), fa(100), efa(100) common/data/t 2, fa, efa, length double precision t, fit(5) common/var/t, fit !$OMP THREADPRIVATE(/var/, /data/) call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, mpi_numprocs, ierr) c Data input c Initialize best solution !$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(tid, k, maxcal, ftol, ifail, w 1, w 2, w 3, w 4, w 5, w 6, x) COPYIN(/data/)LASTPRIVATE(fx) do k = myid, m^5 - 1, mpi_numprocs c Initialize starting point x(1), . . . , x(5) call jic 2(nfit, x, fx) call e 04 ccf(nfit, x, ftol, niw, w 1, w 2, w 3, w 4, w 5, w 6, jic 2, . . . ) if (fx improves best fx) then update(best (x, fx)) endif enddo c Reduce the Open. MP best solution !$OMP END PARALLEL DO c Reduce the MPI best solution call MPI_FINALIZE(ierr) end Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Computational results (m=4) n As we do not have exclusive mode access to the architecture, the times correspond to the minimum time from 5 different executions n The figures corresponding to the mixed mode code (label MPIOpen. MP) correspond to the minimum times obtained for different combinations of MPI processes/Open. MP threads Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Parallel execution time (m=4) Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Speedup (m=4) Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Results for mixed MPI-Open. MP 32 16 8 4 2 Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Computational results (m=5) Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Parallel execution time (m=5) Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Speedup (m=5) Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Results for mixed MPI-Open. MP (m=5) 32 16 8 4 2 Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Conclusions (1/3) m The computational results obtained from all the parallel versions confirm the robustness of the method m For the case of non-expert users and the kind of codes we have been dealing with, we believe that MPI parallel versions are easier and safer m In the case of Open. MP, the proper usage of data scope attributes for the variables involved may be a handicap for users with non-parallel programming expertise m The higher current portability of the MPI version is another factor to be considered Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Conclusions (2/3) m The mixed MPI/Open. MP parallel version is the most expertise-demanding m Nevertheless, as it has been stated by several authors and even for the case of a hybrid architecture, this version does not offer any clear advantage and it has the disadvantage that the combination of processes/threads has to be tuned Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Conclusions (3/3) m We conclude a first step of cooperation in the way of applying HPC techniques to improve performance in astrophysics codes m The scientific aim of applying HPC to computationallyintensive codes in astrophysics has been successfully achieved Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Conclusions and Future Work m The relevance of our results do not come directly from the particular application chosen, but from stating that parallel computing techniques are the key to broach large scale real problems in the mentioned scientific field m From now on we plan to continue this fruitful collaboration by applying parallel computing techniques to some other astrophysical challenge problems Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Supercomputing Centers m http: //www. ciemat. es/ m http: //www. cepba. upc. es/ m http: //www. epcc. ed. ac. uk/ Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

MPI links m Message Passing Interface Forum http: //www. mpi-forum. org/ m MPI: The Complete Reference http: //www. netlib. org/utk/papers/mpibook/mpi-book. html m Parallel Programming with MPI By Peter Pacheco. http: //www. cs. usfca. edu/mpi/ Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

Open. MP links m http: //www. openmp. org/ m http: //www. compunity. org/ Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004

a r G ¡ a i c ! s Introducción a la computación de altas prestaciones Francisco Almeida y Francisco de Sande (falmeida, fsande)@ull. es Departamento de Estadística, I. O. y Computación Universidad de La Laguna, 12 de febrero de 2004 Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004