Parallel Programming in Matlab Tutorial Jeremy Kepner Albert

Parallel Programming in Matlab -Tutorial. Jeremy Kepner, Albert Reuther and Hahn Kim MIT Lincoln Laboratory This work is sponsored by the Defense Advanced Research Projects Administration under Air Force Contract FA 8721 -05 -C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the United States Government. MIT Lincoln Laboratory Slide-1 Parallel Matlab

Outline • Introduction • Zoom. Image Quickstart (MPI) Zoom. Image App Walkthrough (MPI) Zoom. Image Quickstart (p. Matlab) Zoom. Image App Walkthrough (p. Matlab) Beamfomer Quickstart (p. Matlab) Beamformer App Walkthrough (p. Matlab) • • • Slide-2 Parallel Matlab • Tutorial Goals • What is p. Matlab • When should it be used MIT Lincoln Laboratory

Tutorial Goals • Overall Goals – Show to use p. Matlab Distributed MATrices (DMAT) to write parallel programs – Present simplest known process for going from serial Matlab to parallel Matlab that provides good speedup • Section Goals – Quickstart (for the really impatient) How to get up and running fast – Application Walkthrough (for the somewhat impatient) Effective programming using p. Matlab Constructs Four distinct phases of debugging a parallel program – Advanced Topics (for the patient) Parallel performance analysis Alternate programming styles Exploiting different types of parallelism – Example Programs (for those really into this stuff) descriptions of other p. Matlab examples Slide-3 Parallel Matlab MIT Lincoln Laboratory

p. Matlab Description • Provides high level parallel data structures and functions • Parallel functionality can be added to existing serial programs with minor modifications • Distributed matrices/vectors are created by using “maps” that describe data distribution • “Automatic” parallel computation and data distribution is achieved via operator overloading (similar to Matlab*P) • “Pure” Matlab implementation • Uses Matlab. MPI to perform message passing – Offers subset of MPI functions using standard Matlab file I/O – Publicly available: http: //www. ll. mit. edu/Matlab. MPI Slide-4 Parallel Matlab MIT Lincoln Laboratory

p. Matlab Maps and Distributed Matrices • Map Example map. A = map([1 2], . . . % Specifies that cols be dist. over 2 procs {}, . . . % Specifies distribution: defaults to block [0: 1]); % Specifies processors for distribution map. B = map([1 2], {}, [2: 3]); A = rand(m, n, map. A); B = zeros(m, n, map. B); B(: , : ) = A; • % Create random distributed matrix % Create empty distributed matrix % Copy and redistribute data from A to B. Grid and Resulting Distribution A Proc 0 Proc 1 B Proc 2 Slide-5 Parallel Matlab Proc 3 Proc 2 Proc 3 MIT Lincoln Laboratory

Matlab. MPI & p. Matlab Software Layers Application Vector/Matrix Parallel Library Output Analysis Input Comp Conduit Task Library Layer (p. Matlab) Kernel Layer Messaging (Matlab. MPI) Math (Matlab) User Interface Hardware Interface Parallel Hardware • Can build a parallel library with a few messaging primitives • Matlab. MPI provides this messaging capability: MPI_Send(dest, comm, tag, X); X = MPI_Recv(source, comm, tag); Slide-6 Parallel Matlab • Can build a application with a few parallel structures and functions • p. Matlab provides parallel arrays and functions X = ones(n, map. X); Y = zeros(n, map. Y); Y(: , : ) = fft(X); MIT Lincoln Laboratory

Matlab. MPI: Point-to-point Communication • • Any messaging system can be implemented using file I/O File I/O provided by Matlab via load and save functions – – Takes care of complicated buffer packing/unpacking problem Allows basic functions to be implemented in ~250 lines of Matlab code MPI_Send (dest, tag, comm, variable); variable save Sender Data file load variable Receiver Shared File System create Lock file detect variable = MPI_Recv (source, tag, comm); • Sender saves variable in Data file, then creates Lock file • Receiver detects Lock file, then loads Data file Slide-7 Parallel Matlab MIT Lincoln Laboratory

When to use? (Performance 101) • Why parallel, only 2 good reasons: – Run faster (currently program takes hours) Diagnostic: tic, toc – Not enough memory (GBytes) Diagnostic: whose or top • When to use – Best case: entire program is trivially parallel (look for this) – Worst case: no parallelism or lots of communication required (don’t bother) – Not sure: find an expert and ask, this is the best time to get help! • Measuring success – Goal is linear Speedup = Time(1 CPU) / Time(N CPU) (Will create a 1, 2, 4 CPU speedup curve using example) Slide-8 Parallel Matlab MIT Lincoln Laboratory

Parallel Speedup • Ratio of the time on 1 CPU divided by the time on N CPUs • Speedup typically plotted vs number of processors – Linear (ideal) – Superlinear (achievable in some circumstances) – Sublinear (acceptable in most circumstances) – Saturated (usually due to communication) Speedup – If no communication is required, then speedup scales linearly with N – If communication is required, then the non-communicating part should scale linearly with N Number of Processors Slide-9 Parallel Matlab MIT Lincoln Laboratory

Speedup for Fixed and Scaled Problems Scaled Problem Size Gigaflops Speedup Fixed Problem Size Parallel performance Number of Processors • Achieved “classic” super-linear speedup on fixed problem • Achieved speedup of ~300 on 304 processors on scaled problem Slide-10 Parallel Matlab MIT Lincoln Laboratory

Outline • Introduction • Zoom. Image Quickstart (MPI) Zoom. Image App Walkthrough (MPI) Zoom. Image Quickstart (p. Matlab) Zoom. Image App Walkthrough (p. Matlab) Beamfomer Quickstart (p. Matlab) Beamformer App Walkthrough (p. Matlab) • • • Slide-11 Parallel Matlab • Installation • Running • Timing MIT Lincoln Laboratory

Quick. Start - Installation [All users] • Download p. Matlab & Matlab. MPI & p. Matlab Tutorial – http: //www. ll. mit. edu/Matlab. MPI – Unpack tar ball in home directory and add paths to ~/matlab/startup. m addpath ~/p. Matlab/Matlab. MPI/src addpath ~/p. Matlab/src [Note: home directory must be visible to all processors] • Validate installation and help – start MATLAB – cd p. Matlab. Tutorial – Type “help p. Matlab” “help Matlab. MPI” Slide-12 Parallel Matlab MIT Lincoln Laboratory

Quick. Start - Installation [LLGrid users] • Copy tutorial – Copy z: toolstutorials to z: • Validate installation and help – start MATLAB – cd z: tutorialsp. Matlab. Tutorial – Type “help p. Matlab” and “help Matlab. MPI” Slide-13 Parallel Matlab MIT Lincoln Laboratory

Quick. Start - Running • Run mpi. Zoom. Image – Edit RUN. m and set: m_file = ’mpi. Zoomimage’; Ncpus = 1; cpus = {}; – – • • • type “RUN” Record processing_time Repeat with: Ncpus = 2; Record Time Repeat with: cpus ={’machine 1’ ’machine 2’}; OR cpus =’grid’; [LLGrid users] Record Time Repeat with: Ncpus = 4; Record Time – – [All users] Type “!type Mat. MPI*. out” or “!more Mat. MPI/*. out” ; Examine processing_time Congratulations! You have just completed the 4 step process Slide-14 Parallel Matlab MIT Lincoln Laboratory

Quick. Start - Timing • Enter your data into mpi. Zoom. Image_times. m T 1 T 2 a T 2 b T 4 = = 15. 9; 9. 22; 8. 08; 4. 31; % % MPI_Run('mpi. Zoomimage', 1, {}) MPI_Run('mpi. Zoomimage', 2, cpus)) MPI_Run('mpi. Zoomimage', 4, cpus)) • Run mpi. Zoom. Image_times • Divide T(1 CPUs) by T(2 CPUs) and T(4 CPUs) speedup = 1. 0000 2. 0297 3. 8051 – Goal is linear speedup Slide-15 Parallel Matlab MIT Lincoln Laboratory

Outline • Introduction • Zoom. Image Quickstart (MPI) Zoom. Image App Walkthrough (MPI) Zoom. Image Quickstart (p. Matlab) Zoom. Image App Walkthrough (p. Matlab) Beamfomer Quickstart (p. Matlab) Beamformer App Walkthrough (p. Matlab) • • • Slide-16 Parallel Matlab • Description • Setup • Scatter Indices • Zoom and Gather • Display Results MIT Lincoln Laboratory

Application Description • Parallel image generation 0. Create reference image 1. Compute zoom factors 2. Zoom images 3. Display • 2 Core dimensions – N_image, num. Frames – Choose to parallelize along frames (embarassingly parallel) Slide-17 Parallel Matlab MIT Lincoln Laboratory

Application Output Tim e Slide-18 Parallel Matlab MIT Lincoln Laboratory

Setup Code Implicitly Parallel Code Required Change % Setup the MPI world. MPI_Init; % Initialize MPI. comm = MPI_COMM_WORLD; % Create communicator. % Get size and rank. Ncpus = MPI_Comm_size(comm); my_rank = MPI_Comm_rank(comm); leader = 0; % Set who is the leader % Create base message tags. input_tag = 20000; output_tag = 30000; disp(['my_rank: ', num 2 str(my_rank)]); % Print rank. Comments • MPI_COMM_WORLD stores info necessary to communicate • MPI_Comm_size() provides number of processors • MPI_Comm_rank() is the ID of the current processor • Tags are used to differentiate messages being sent between the same processors. Must be unique! Slide-19 Parallel Matlab MIT Lincoln Laboratory

Things to try >> Ncpus = 4 Ncpus is the number of Matlab sessions that were launched >> my_rank = 0 Interactive Matlab session is always rank = 0 Slide-20 Parallel Matlab MIT Lincoln Laboratory

Scatter Index Code Implicitly Parallel Code Required Change scale. Factor = linspace(start. Scale, end. Scale, num. Frames); % Compute scale factor frame. Index = 1: num. Frames; % Compute indices for each image. frame. Rank = mod(frame. Index, Ncpus); % Deal out indices to each processor. if (my_rank == leader) % Leader does sends. for dest_rank=0: Ncpus-1 % Loop over all processors. dest_data = find(frame. Rank == dest_rank); % Find indices to send. % Copy or send. if (dest_rank == leader) my_frame. Index = dest_data; else MPI_Send(dest_rank, input_tag, comm, dest_data); end end if (my_rank ~= leader) % Everyone but leader receives the data. my_frame. Index = MPI_Recv( leader, input_tag, comm ); % Receive data. end Comments • If (my_rank …) is used to differentiate processors • Frames are destributed in a cyclic manner • Leader distributes work to self via a simple copy • MPI_Send and MPI_Recv send and receive the indices. Slide-21 Parallel Matlab MIT Lincoln Laboratory

Things to try >> my_frame. Index = >> frame. Rank = 1 1 4 8 2 2 12 3 3 16 0 0 20 1 1 24 2 2 28 3 3 32 0 0 – my_frame. Index different on each processor – frame. Rank the same on each processor Slide-22 Parallel Matlab MIT Lincoln Laboratory

Zoom Image and Gather Results Implicitly Parallel Code Required Change % Create reference frame and zoom image. ref. Frame = reference. Frame(n_image, 0. 1, 0. 8); my_zoomed. Frames = zoom. Frames(ref. Frame, scale. Factor(my_frame. Index), blur. Sigma); if (my_rank ~= leader) % Everyone but the leader sends the data back. MPI_Send(leader, output_tag, comm, my_zoomed. Frames); % Send images back. end if (my_rank == leader) % Leader receives data. zoomed. Frames = zeros(n_image, num. Frames); % Allocate array for send_rank=0: Ncpus-1 % Loop over all processors. send_frame. Index = find(frame. Rank == send_rank); % Find frames to send. if (send_rank == leader) % Copy or receive. zoomed. Frames(: , send_frame. Index) = my_zoomed. Frames; else zoomed. Frames(: , send_frame. Index) = MPI_Recv(send_rank, output_tag, comm); end end Comments • zoom. Frames computed for different scale factors on each processor • Everyone sends their images back to leader Slide-23 Parallel Matlab MIT Lincoln Laboratory

Things to try >> whos ref. Frame my_zoomed. Frames Name Size Bytes my_zoomed. Frames 256 x 8 4194304 ref. Frame 256 x 256 524288 zoomed. Frames 256 x 32 16777216 Class double array -Size of global indices are the same dimensions of local part -global indices shows those indices of DMAT that are local -User function returns arrays consistent with local part of DMAT Slide-24 Parallel Matlab MIT Lincoln Laboratory

Finalize and Display Results Implicitly Parallel Code Required Change % Shut down everyone but leader. MPI_Finalize; If (my_rank ~= leader) exit; end % Display simulated frames. figure(1); clf; set(gcf, 'Name', 'Simulated Frames', 'Double. Buffer', 'on', 'Number. Title', 'off'); for frame. Index=[1: num. Frames] imagesc(squeeze(zoomed. Frames(: , frame. Index))); drawnow; end Comments • MPI_Finalize exits everyone but the leader • Can now do operations that make sense only on leader – Display output Slide-25 Parallel Matlab MIT Lincoln Laboratory

Outline • Introduction • Zoom. Image Quickstart (MPI) Zoom. Image App Walkthrough (MPI) Zoom. Image Quickstart (p. Matlab) Zoom. Image App Walkthrough (p. Matlab) Beamfomer Quickstart (p. Matlab) Beamformer App Walkthrough (p. Matlab) • • • Slide-26 Parallel Matlab • Running • Timing MIT Lincoln Laboratory

Quick. Start - Running • Run p. Zoom. Image – – Edit p. Zoom. Image. m and set “PARALLEL = 0; ” Edit RUN. m and set: m_file = ’p. Zoom. Image’; Ncpus = 1; cpus = {}; • • – – type “RUN” Record processing_time Repeat with: PARALLEL = 1; Record Time Repeat with: Ncpus = 2; Record Time Repeat with: cpus ={’machine 1’ ’machine 2’}; OR cpus =’grid’; [LLGrid users] Record Time Repeat with: Ncpus = 4; Record Time – – [All users] Type “!type Mat. MPI*. out” or “!more Mat. MPI/*. out” ; Examine processing_time Congratulations! You have just completed the 4 step process Slide-27 Parallel Matlab MIT Lincoln Laboratory

Quick. Start - Timing • Enter your data into p. Zoom. Image_times. m T 1 a T 1 b T 2 a T 2 b T 4 • • = = = 16. 4; 15. 9; 9. 22; 8. 08; 4. 31; % % % PARALLEL PARALLEL = = = 0, 1, 1, MPI_Run('p. Zoom. Image', 1, {}) MPI_Run('p. Zoom. Image', 2, cpus)) MPI_Run('p. Zoom. Image', 4 , cpus)) Run p. Zoom. Image_times 1 st Comparison PARALLEL=0 vs PARALLEL=1 T 1 a/T 1 b = 1. 03 – • Overhead of using p. Matlab, keep this small (few %) or we have already lost Divide T(1 CPUs) by T(2 CPUs) and T(4 CPUs) speedup = – Slide-28 Parallel Matlab 1. 0000 2. 0297 3. 8051 Goal is linear speedup MIT Lincoln Laboratory

Outline • Introduction • Zoom. Image Quickstart (MPI) Zoom. Image App Walkthrough (MPI) Zoom. Image Quickstart (p. Matlab) Zoom. Image App Walkthrough (p. Matlab) Beamfomer Quickstart (p. Matlab) Beamformer App Walkthrough (p. Matlab) • • • Slide-29 Parallel Matlab • Description • Setup • Scatter Indices • Zoom and Gather • Display Results • Debugging MIT Lincoln Laboratory

Setup Code Implicitly Parallel Code PARALLEL = 1; Required Change % Turn p. Matlab on or off. Can be 1 or 0. p. Matlab_Init; Ncpus = p. MATLAB. comm_size; my_rank = p. MATLAB. my_rank; % Initialize p. Matlab. % Get number of cpus. % Get my rank. Zmap = 1; % Initialize maps to 1 (i. e. no map). if (PARALLEL) % Create map that breaks up array along 3 rd dimension. Zmap = map([1 1 Ncpus], {}, 0: Ncpus-1 ); end Comments • PARALLEL=1 flag allows library to be turned on an off • Setting Zmap=1 will create regular Matlab arrays • Zmap = map([1 1 Ncpus], {}, 0: Ncpus-1); Map Object Slide-30 Parallel Matlab Processor Grid Use default (chops 3 rd dimension block distribution into Ncpus pieces) Processor list (begins at 0!) MIT Lincoln Laboratory

Things to try >> Ncpus = 4 Ncpus is the number of Matlab sessions that were launched >> my_rank = 0 Interactive Matlab session is always my_rank = 0 >> Zmap Map object, Dimension: 3 Grid: (: , 1) = 0 (: , 2) = 1 (: , 3) = 2 (: , 4) = 3 Overlap: Distribution: Dim 1: b Dim 2: b Slide-31 Parallel Matlab Map object contains number of dimensions, grid of processors, and distribution in each dimension, b=block, c=cyclic, bc=block-cyclic Dim 3: b MIT Lincoln Laboratory

Scatter Index Code Implicitly Parallel Code Required Change % Allocate distributed array to hold images. zoomed. Frames = zeros(n_image, num. Frames, Zmap); % Compute which frames are local along 3 rd dimension. my_frame. Index = global_ind(zoomed. Frames, 3); Comments • zeros() overloaded and returns a DMAT • – – Matlab knows to call a p. Matlab function Most functions aren’t overloaded global_ind() returns those indices that are local to the processor – Slide-32 Parallel Matlab Use these indices to select which indices to process locally MIT Lincoln Laboratory

Things to try >> whos zoomed. Frames Name Size zoomed. Frames 256 x 32 Bytes 4200104 Class dmat object Grand total is 524416 elements using 4200104 bytes >> z 0 = local(zoomed. Frames); >> whos z 0 Name Size z 0 256 x 8 Bytes 4194304 Class double array Grand total is 524288 elements using 4194304 bytes >> my_frame. Index = 1 – – 2 3 4 5 6 7 8 zoomed. Frames is a dmat object Size of local part of zoomed. Fames is 2 nd dimension divided by Ncpus Local part of zoomed. Frames is a regular double array my_frame. Index is a block of indices Slide-33 Parallel Matlab MIT Lincoln Laboratory

Zoom Image and Gather Results Implicitly Parallel Code Required Change % Compute scale factor scale. Factor = linspace(start. Scale, end. Scale, num. Frames); % Create reference frame and zoom image. ref. Frame = reference. Frame(n_image, 0. 1, 0. 8); my_zoomed. Frames = zoom. Frames(ref. Frame, scale. Factor(my_frame. Index), blur. Sigma); % Copy back into global array. zoomed. Frames = put_local(zoomed. Frames, my_zoomed. Frames); % Aggregate on leader. agg. Frames = agg(zoomed. Frames); Comments • zoom. Frames computed for different scale factors on each processor • Everyone sends their images back to leader • agg() collects a DMAT onto leader (rank=0) – – Slide-34 Parallel Matlab Returns regular Matlab array Remember only exists on leader MIT Lincoln Laboratory

Finalize and Display Results Implicitly Parallel Code Required Change % Exit on all but the leader. p. Matlab_Finalize; % Display simulated frames. figure(1); clf; set(gcf, 'Name', 'Simulated Frames', 'Double. Buffer', 'on', 'Number. Title', 'off'); for frame. Index=[1: num. Frames] imagesc(squeeze(agg. Frames(: , frame. Index))); drawnow; end Comments • p. Matlab_Finalize exits everyone but the leader • Can now do operations that make sense only on leader – Display output Slide-35 Parallel Matlab MIT Lincoln Laboratory

Outline • Introduction • Zoom. Image Quickstart (MPI) Zoom. Image App Walkthrough (MPI) Zoom. Image Quickstart (p. Matlab) Zoom. Image App Walkthrough (p. Matlab) Beamfomer Quickstart (p. Matlab) Beamformer App Walkthrough (p. Matlab) • • • Slide-36 Parallel Matlab • Running • Timing MIT Lincoln Laboratory

Quick. Start - Running • Run p. Beamformer – – Edit p. Beamformer. m and set “PARALLEL = 0; ” Edit RUN. m and set: m_file = ’p. Beamformer’; Ncpus = 1; cpus = {}; • • – – type “RUN” Record processing_time Repeat with: PARALLEL = 1; Record Time Repeat with: Ncpus = 2; Record Time Repeat with: cpus ={’machine 1’ ’machine 2’}; OR cpus =’grid’; [LLGrid users] Record Time Repeat with: Ncpus = 4; Record Time – – [All users] Type “!type Mat. MPI*. out” or “!more Mat. MPI/*. out” ; E� xamine processing_time Congratulations! You have just completed the 4 step process Slide-37 Parallel Matlab MIT Lincoln Laboratory

Quick. Start - Timing • Enter your data into p. Beamformer_times. m T 1 a T 1 b T 2 a T 2 b T 4 • = = = 16. 4; 15. 9; 9. 22; 8. 08; 4. 31; % % % PARALLEL PARALLEL = = = 0, 1, 1, MPI_Run('p. Beamformer', 1, {}) MPI_Run('p. Beamformer', 2, cpus) MPI_Run('p. Beamformer', 4, cpus) 1 st Comparison PARALLEL=0 vs PARALLEL=1 T 1 a/T 1 b = 1. 03 – Overhead of using p. Matlab, keep this small (few %) or we have already lost • Divide T(1 CPUs) by T(4 CPU 2) and T(2 CPUs) speedup = 1. 0000 2. 0297 3. 8051 – Goal is linear speedup Slide-38 Parallel Matlab MIT Lincoln Laboratory

Outline • Introduction • Zoom. Image Quickstart (MPI) Zoom. Image App Walkthrough (MPI) Zoom. Image Quickstart (p. Matlab) Zoom. Image App Walkthrough (p. Matlab) Beamfomer Quickstart (p. Matlab) Beamformer App Walkthrough (p. Matlab) • • • Slide-39 Parallel Matlab • Goals and Description • Setup • Allocate DMATs • Create steering vectors • Create targets • Create sensor input • Form Beams • Sum Frequencies • Display results • Debugging MIT Lincoln Laboratory

Application Description • Parallel beamformer for a uniform linear array) Linear Array 0. Create targets 1. Create synthetic sensor returns 2. Form beams and save results Source 1 3. Display Time/Beam plot Source 2 • 4 Core dimensions – Nsensors, Nsnapshots, Nfrequencies, Nbeams – Choose to parallelize along frequency (embarrasingly parallel) Slide-40 Parallel Matlab MIT Lincoln Laboratory

Application Output Input targets Beamformed output Slide-41 Parallel Matlab Synthetic sensor response Summed output MIT Lincoln Laboratory

Setup Code Implicitly Parallel Code Required Change % p. MATLAB SETUP ----------tic; % Start timer. PARALLEL = 1; % Turn p. Matlab on or off. Can be 1 or 0. p. Matlab_Init; Ncpus = p. MATLAB. comm_size; my_rank = p. MATLAB. my_rank; % Initialize p. Matlab. % Get number of cpus. % Get my rank. Xmap = 1; % Initialize maps to 1 (i. e. no map). if (PARALLEL) % Create map that breaks up array along 2 nd dimension. Xmap = map([1 Ncpus 1], {}, 0: Ncpus-1 ); end Comments • PARALLEL=1 flag allows library to be turned on an off • Setting Xmap=1 will create regular Matlab arrays • Xmap = map([1 Ncpus 1], {}, 0: Ncpus-1); Map Object Slide-42 Parallel Matlab Processor Grid Use default (chops 2 nd dimension block distribution into Ncpus pieces) Processor list (begins at 0!) MIT Lincoln Laboratory

Things to try >> Ncpus = 4 Ncpus is the number of Matlab sessions that were launched >> my_rank = 0 Interactive Matlab session is always rank = 0 >> Xmap Map object Dimension: 3 Grid: 0 1 2 Overlap: Distribution: Dim 1: b Dim 2: b Dim 3: b Slide-43 Parallel Matlab 3 Map object contains number of dimensions, grid of processors, and distribution in each dimension, b=block, c=cyclic, bc=block-cyclic MIT Lincoln Laboratory

Allocate Distributed Arrays (DMATs) Implicitly Parallel Code Required Change % ALLOCATE PARALLEL DATA STRUCTURES ----------% Set array dimensions (always test on small problems first). Nsensors = 90; Nfreqs = 50; Nsnapshots = 100; Nbeams = 80; % Initial array of sources. X 0 = zeros(Nsnapshots, Nfreqs, Nbeams, Xmap); % Synthetic sensor input data. X 1 = complex(zeros(Nsnapshots, Nfreqs, Nsensors, Xmap)); % Beamformed output data. X 2 = zeros(Nsnapshots, Nfreqs, Nbeams, Xmap); % Intermediate summed image. X 3 = zeros(Nsnapshots, Ncpus, Nbeams, Xmap); Comments • Write parameterized code, and test on small problems first. • Can reuse Xmap on all arrays because • – – All arrays are 3 D Want to break along 2 nd dimension zeros() and complex() are overloaded and return DMATs – – Slide-44 Parallel Matlab knows to call a p. Matlab function Most functions aren’t overloaded MIT Lincoln Laboratory

Things to try >> whos X 0 X 1 X 2 X 3 Name Size X 0 100 x 200 x 80 X 1 100 x 200 x 90 X 2 100 x 200 x 80 X 3 100 x 4 x 80 Bytes Class 3206136 7206136 3206136 69744 dmat object >> x 0 = local(X 0); >> whos x 0 Name Size x 0 100 x 50 x 80 Bytes 3200000 Class double array >> x 1 = local(X 1); >> whos x 1 Name Size x 1 100 x 50 x 90 Bytes 7200000 Class double array (complex) -Size of X 3 is Ncpus in 2 nd dimension -Size of local part of X 0 is 2 nd dimension divided by Ncpus -Local part of X 1 is a regular complex matrix Slide-45 Parallel Matlab MIT Lincoln Laboratory

Create Steering Vectors Implicitly Parallel Code Required Change % CREATE STEERING VECTORS ----------% Pick an arbitrary set of frequencies. freq 0 = 10; frequencies = freq 0 + (0: Nfreqs-1); % Get frequencies local to this processor. [my. I_snapshot my. I_freq my. I_sensor] = global_ind(X 1); my. Freqs = frequencies(my. I_freq); % Create local steering vectors by passing local frequencies. my. V = squeeze(p. Beamformer_vectors(Nsensors, Nbeams, my. Freqs)); Comments • global_ind() returns those indices that are local to the processor – • Use these indices to select which values to use from a larger table User function written to return array based on the size of the input – – Slide-46 Parallel Matlab Result is consistent with local part of DMATs Be careful of squeeze function, can eliminate needed dimensions MIT Lincoln Laboratory

Things to try >> whos my. I_snapshot Name my. I_freq my. I_sensor my. I_snapshot my. I_freq my. I_sensor Size 1 x 50 1 x 90 1 x 100 Bytes Class 400 double array 720 double array 800 double array >> my. I_freq = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 >> whos my. V Name Size my. V 90 x 80 x 50 Bytes 5760000 Class double array (complex) -Size of global indices are the same dimensions of local part -global indices shows those indices of DMAT that are local -User function returns arrays consistent with local part of DMAT Slide-47 Parallel Matlab MIT Lincoln Laboratory

Create Targets Implicitly Parallel Code Required Change % STEP 0: Insert targets ----------% Get local data. X 0_local = local(X 0); % Insert two targets at different angles. X 0_local(: , round(0. 25*Nbeams)) = 1; X 0_local(: , round(0. 5*Nbeams)) = 1; Comments • local() returns piece of DMAT store locally • Always try to work on local part of data • – – – Regular Matlab arrays, all Matlab functions work Performance guaranteed to be same at Matlab Impossible to do accidental communication If can’t work locally, can do some things directly on DMAT, e. g. – Slide-48 Parallel Matlab X 0(i, j, k) = 1; MIT Lincoln Laboratory

Create Sensor Input Implicitly Parallel Code Required Change % STEP 1: CREATE SYNTHETIC DATA. ----------% Get the local arrays. X 1_local = local(X 1); % Loop over snapshots, then the local frequencies for i_snapshot=1: Nsnapshots for i_freq=1: length(my. I_freq) % Convert from beams to sensors. X 1_local(i_snapshot, i_freq, : ) =. . . squeeze(my. V(: , i_freq)) * squeeze(X 0_local(i_snapshot, i_freq, : )); end % Put local array back. X 1 = put_local(X 1, X 1_local); % Add some noise, X 1 = X 1 + complex(rand(Nsnapshots, Nfreqs, Nsensors, Xmap), . . . rand(Nsnapshots, Nfreqs, Nsensors, Xmap) ); Comments • Looping only done over length of global indices that are local • put_local() replaces local part of DMAT with argument (no checking!) • plus(), complex(), and rand() all overloaded to work with DMATs – Slide-49 Parallel Matlab rand may produce values in different order MIT Lincoln Laboratory

Beamform and Save Data Implicitly Parallel Code Required Change % STEP 2: BEAMFORM AND SAVE DATA. ----------X 1_local = local(X 1); % Get the local arrays. X 2_local = local(X 2); % Loop over snapshots, loop over the local fequencies. for i_snapshot=1: Nsnapshots for i_freq=1: length(my. I_freq) % Convert from sensors to beams. X 2_local(i_snapshot, i_freq, : ) = abs(squeeze(my. V(: , i_freq))' * … squeeze(X 1_local(i_snapshot, i_freq, : ))). ^2; end processing_time = toc % Save data (1 file per freq). for i_freq=1: length(my. I_freq) X_i_freq = squeeze(X 2_local(: , i_freq, : )); % Get the beamformed data. i_global_freq = my. I_freq(i_freq); % Get the global index of this frequency. filename = ['dat/p. Beamformer_freq. ' num 2 str(i_global_freq) '. mat']; save(filename, 'X_i_freq'); % Save to a file. end Comments • Similar to previous step • Save files based on physical dimensions (not my_rank) – Slide-50 Parallel Matlab Independent of how many processors are used MIT Lincoln Laboratory

Sum Frequencies Implicitly Parallel Code Required Change % STEP 3: SUM ACROSS FREQUNCY. ----------% Sum local part across fequency. X 2_local_sum = sum(X 2_local, 2); % Put into global array. X 3 = put_local(X 3, X 2_local_sum); % Aggregate X 3 back to the leader for display. x 3 = agg(X 3); Comments • Sum not supported, so need to do in steps. • – – Sum local part Put into a global array agg() collects a DMAT onto leader (rank=0) – – Slide-51 Parallel Matlab Returns regular Matlab array Remember only exists on leader MIT Lincoln Laboratory

Finalize and Display Results Implicitly Parallel Code Required Change % STEP 4: Finalize and display. ----------disp('SUCCESS'); % Print success. % Exit on all but the leader. p. Matlab_Finalize; % Complete local sum. x 3_sum = squeeze(sum(x 3, 2)); % Display results imagesc( abs(squeeze(X 0_local(: , 1, : ))) ); pause(1. 0); imagesc( abs(squeeze(X 1_local(: , 1, : ))) ); pause(1. 0); imagesc( abs(squeeze(X 2_local(: , 1, : ))) ); pause(1. 0); imagesc(x 3_sum) Comments • p. Matlab_Finalize exits everyone but the leader • Can now do operations that make sense only on leader – – Slide-52 Parallel Matlab Final sum of aggregated array Display output MIT Lincoln Laboratory

Application Debugging • Simple four step process for debugging a parallel program Step 1 Serial Matlab • • Add DMATs Step 2 Serial p. Matlab Functional correctness Add Maps Step 3 Mapped p. Matlab correctness Add Ranks Step 4 Parallel p. Matlab Add CPUs Optimized p. Matlab Parallel correctness Performance Step 1: Add distributed matrices without maps, verify functional correctness PARALLEL=0; eval( MPI_Run(‘p. Zoom. Image’, 1, {}) ); Step 2: Add maps, run on 1 CPU, verify p. Matlab correctness, compare performance with Step 1 PARALLEL=1; eval( MPI_Run(‘p. Zoom. Image’, 1, {}) ); Step 3: Run with more processes (ranks), verify parallel correctness PARALLEL=1; eval( MPI_Run(‘p. Zoom. Image’, 2, {}) ); Step 4: Run with more CPUs, compare performance with Step 2 PARALLEL=1; eval( MPI_Run(‘p. Zoom. Image’, 4, cpus) ); • Always debug at lowest numbered step possible Slide-53 Parallel Matlab MIT Lincoln Laboratory

Different Access Styles • Implicit global access Y(: , : ) = X; Y(i, j) = X(k, l); Most elegant; performance issues; accidental communication • Explicit local access x = local(X); x(i, j) = 1; X = put_local(X, x); A little clumsy; guaranteed performance; controlled communication • Implicit local access [I J] = global_ind(X); for i=1: length(I) for j=1: length(I) X_ij = X(I(i), J(I)); end Slide-54 Parallel Matlab MIT Lincoln Laboratory

Summary • Tutorial has introduced – Using Matlab. MPI – Using p. Matlab Distributed MATtrices (DMAT) – Four step process for writing a parallel Matlab program Step 1 Serial Matlab Add DMATs Get It Right • Serial p. Matlab Functional correctness Add Maps Step 3 Mapped p. Matlab Add Ranks p. Matlab correctness Step 4 Parallel p. Matlab Parallel correctness Add CPUs Optimized p. Matlab Performance Make It Fast Provided hands on experience with – – Slide-55 Parallel Matlab Step 2 Running Matlab. MPI and p. Matlab Using distributed matrices Using four step process Measuring and evaluating performance MIT Lincoln Laboratory

Advanced Examples Slide-56 Parallel Matlab MIT Lincoln Laboratory

Clutter Simulation Example (see p. Matlab/examples/Clutter. Sim. m) Speedup Fixed Problem Size (Linux Cluster) Parallel performance PARALLEL = 1; map. X = 1; map. Y = 1; % Initialize % Map X to first half and Y to second half. if (PARALLEL) p. Matlab_Init; Ncpus=comm_vars. comm_size; map. X=map([1 Ncpus/2], {}, [1: Ncpus/2]) map. Y=map([Ncpus/2 1], {}, [Ncpus/2+1: Ncpus]); end % Create arrays. X = complex(rand(N, M, map. X), rand(N, M, map. X)); Y = complex(zeros(N, M, map. Y); % Initialize coefficents coefs =. . . weights =. . . Number of Processors % Parallel filter + corner turn. Y(: , : ) = conv 2(coefs, X); % Parallel matrix multiply. Y(: , : ) = weights*Y; % Finalize p. MATLAB and exit. if (PARALLEL) p. Matlab_Finalize; • Achieved “classic” super-linear speedup on fixed problem • Serial and Parallel code “identical” Slide-57 Parallel Matlab MIT Lincoln Laboratory

Eight Stage Simulator Pipeline (see p. Matlab/examples/Generator. Processor. m) Beamform Pulse compress Detect targets Parallel Signal Processor Channel response Convolve with pulse Inject targets Initialize Parallel Data Generator Matlab Map Code Example Processor Distribution • • - 0, 1 - 2, 3 - 4, 5 - 6, 7 - all map 3 map 2 map 1 map 0 = = map([2 map([1 1], 2], {}, {}, 0: 1); 2: 3); 4: 5); 6: 7); Goal: create simulated data and use to test signal processing parallelize all stages; requires 3 “corner turns” p. Matlab allows serial and parallel code to be nearly identical Easy to change parallel mapping; set map=1 to get serial code Slide-58 Parallel Matlab MIT Lincoln Laboratory

p. Matlab Code (see p. Matlab/examples/Generator. Processor. m) p. MATLAB_Init; Set. Parameters; Set. Maps; %Initialize. Xrand = 0. 01*squeeze(complex(rand(Ns, Nb, map 0), rand(Ns, Nb, map 0))); X 0 = squeeze(complex(zeros(Ns, Nb, map 0))); X 1 = squeeze(complex(zeros(Ns, Nb, map 1))); X 2 = squeeze(complex(zeros(Ns, Nc, map 2))); X 3 = squeeze(complex(zeros(Ns, Nc, map 3))); X 4 = squeeze(complex(zeros(Ns, Nb, map 3))); . . . for i_time=1: NUM_TIME % Loop over time steps. X 0(: , : ) = Xrand; for i_target=1: NUM_TARGETS [i_s i_c] = targets(i_time, i_target, : ); X 0(i_s, i_c) = 1; end X 1(: , : ) = conv 2(X 0, pulse_shape, 'same'); X 2(: , : ) = X 1*steering_vectors; X 3(: , : ) = conv 2(X 2, kernel, 'same'); X 4(: , : ) = X 3*steering_vectors’; [i_range, i_beam] = find(abs(X 4) > DET); end p. MATLAB_Finalize; Implicitly Parallel Code Slide-59 Parallel Matlab % Initialize data % Insert targets. % % % Convolve and corner turn. Channelize and corner turn. Pulse compress and corner turn. Beamform. Detect targets % Finalize. Required Change MIT Lincoln Laboratory

Parallel Image Processing (see p. Matlab/examples/p. Blurimage. m) map. X = map([Ncpus/2 2], {}, [0: Ncpus-1], [N_k M_k]); % Create map with overlap X = zeros(N, M, map. X); % Create starting images. [my. I my. J] = global_ind(X); % Get local indices. % Assign values to image. X = put_local(X, … (my. I. ' * ones(1, length(my. J))) + (ones(1, length(my. I)). ' * my. J) ); X_local = local(X); % Get local data. % Perform convolution. X_local(1: end-N_k+1, 1: end-M_k+1) = conv 2(X_local, kernel, 'valid'); X = put_local(X, X_local); % Put local back in global. X = synch(X); % Copy overlap. Implicitly Parallel Code Slide-60 Parallel Matlab Required Change MIT Lincoln Laboratory