Tuning Di FX 2 for performance Adam Deller

  • Slides: 35
Download presentation
Tuning Di. FX 2 for performance Adam Deller ASTRON 6 th Di. FX workshop,

Tuning Di. FX 2 for performance Adam Deller ASTRON 6 th Di. FX workshop, CSIRO ATNF, Sydney AUS

Outline l I/O bottlenecks and solutions l l l Communication with the real world

Outline l I/O bottlenecks and solutions l l l Communication with the real world (reading raw data, writing visibilities) Interprocess communication Keeping out of memory trouble Minimizing CPU load in various corners of parameter space For more information and pictures: http: //cira. ivec. org/dokuwiki/doku. php/difx/mpifxcorr/ Adam Deller 6 th Di. FX workshop, CSIRO ATNF

Getting data into Di. FX Data. Stream 1 Baseband data processing buffer Core 1

Getting data into Di. FX Data. Stream 1 Baseband data processing buffer Core 1 processing buffer Data. Stream 2 Source data … Core 2 … Data. Stream N Core M Large, segmented ring buffer Timerange, destination Visibilities Master Node Visibility buffer Adam Deller 6 th Di. FX workshop, CSIRO ATNF

Getting data into Di. FX l l How to test? neutered_difx, with a small

Getting data into Di. FX l l How to test? neutered_difx, with a small number of channels Fundamental limit: native transfer speed (disk read, network pipe) l l Potential troublemaker: CPU utilisation on datastream node (competition) l l If this is the problem, buy a RAID or get infiniband or … Can come from tsys estimation Tweaking: datastream databuffer Adam Deller 6 th Di. FX workshop, CSIRO ATNF

Datastream databuffer /“Subint” Key parameters: data. Buffer. Factor n. Data. Segments subint. NS Only

Datastream databuffer /“Subint” Key parameters: data. Buffer. Factor n. Data. Segments subint. NS Only real potential problem I/O-wise: buffer too short (databufferfactor) Adam Deller 6 th Di. FX workshop, CSIRO ATNF

Getting visibilities out of Di. FX Data. Stream 1 Baseband data processing buffer Core

Getting visibilities out of Di. FX Data. Stream 1 Baseband data processing buffer Core 1 processing buffer Data. Stream 2 Source data … Core 2 … Data. Stream N Core M Large, segmented ring buffer Timerange, destination Visibilities Master Node Visibility buffer Adam Deller To disk 6 th Di. FX workshop, CSIRO ATNF

Getting visibilities out of Di. FX l l Fx. Manager writes the visibilities to

Getting visibilities out of Di. FX l l Fx. Manager writes the visibilities to disk This is very rarely a problem unless you have a dying disk or very large and/or frequent visibility dumps Testing: neutered_difx + fake data source (ensures good input speeds) Tweaking: none l If you want to write out visibilities faster, put a fast disk (probably RAID) on the manager node! Adam Deller 6 th Di. FX workshop, CSIRO ATNF

Interprocess @ the Datastream Data. Stream 1 Baseband data processing buffer Core 1 processing

Interprocess @ the Datastream Data. Stream 1 Baseband data processing buffer Core 1 processing buffer Data. Stream 2 Source data … Core 2 … Data. Stream N Core M Large, segmented ring buffer Timerange, destination Visibilities Master Node Visibility buffer Adam Deller 6 th Di. FX workshop, CSIRO ATNF

Interprocess @ the Datastream l l l Generally not a problem Tweaking: data. Buffer.

Interprocess @ the Datastream l l l Generally not a problem Tweaking: data. Buffer. Factor, ensure reasonable size (avoids latency issues) Default (32) generally ok but could usually be bigger w/o problems (increase n. Segments also) Adam Deller 6 th Di. FX workshop, CSIRO ATNF

Interprocess @ the Core Data. Stream 1 Baseband data processing buffer Core 1 processing

Interprocess @ the Core Data. Stream 1 Baseband data processing buffer Core 1 processing buffer Data. Stream 2 Source data … Core 2 … Data. Stream N Core M Large, segmented ring buffer Timerange, destination Visibilities Master Node Visibility buffer Tweaking: • subint. NS • Output visibility size (n. Chan / n. Baselines) Adam Deller 6 th Di. FX workshop, CSIRO ATNF

Interprocess @ the Core Adam Deller 6 th Di. FX workshop, CSIRO ATNF

Interprocess @ the Core Adam Deller 6 th Di. FX workshop, CSIRO ATNF

Interprocess @ the Core l In terms of reducing data transmission, increasing subint. NS

Interprocess @ the Core l In terms of reducing data transmission, increasing subint. NS is the only real knob to turn l l Unimportant for continuum, single phase centre - it’s only very high spectral resolution and/or multiphase centre where this is relevant In those cases, bigger is better; but be careful about memory (later) Adam Deller 6 th Di. FX workshop, CSIRO ATNF

Interprocess @ the Fx. Manager Data. Stream 1 Baseband data processing buffer Core 1

Interprocess @ the Fx. Manager Data. Stream 1 Baseband data processing buffer Core 1 processing buffer Data. Stream 2 Source data … Core 2 … Data. Stream N Core M Large, segmented ring buffer Timerange, destination Visibilities Master Node Visibility buffer The most common trouble point! Must aggregate data from all Core nodes, can lead to high data rates Adam Deller 6 th Di. FX workshop, CSIRO ATNF

Interprocess @ the Fx. Manager Adam Deller 6 th Di. FX workshop, CSIRO ATNF

Interprocess @ the Fx. Manager Adam Deller 6 th Di. FX workshop, CSIRO ATNF

Interprocess @ the Fx. Manager l l To calculate the rate into Fx. Manager,

Interprocess @ the Fx. Manager l l To calculate the rate into Fx. Manager, work out the rate for one Core node and scale Tweaking: maximise subint. NS! Or (although this is usually not possible) reduce visibility size (via n. Chan or the number of phase centers) Adam Deller 6 th Di. FX workshop, CSIRO ATNF

Memory @ the Datastream l Just don’t make the combination of data. Buffer. Factor

Memory @ the Datastream l Just don’t make the combination of data. Buffer. Factor and subint. NS too big (can also control via “send. Size”) Adam Deller 6 th Di. FX workshop, CSIRO ATNF

Memory @ the Core l Usually the biggest problem, memorywise Adam Deller 6 th

Memory @ the Core l Usually the biggest problem, memorywise Adam Deller 6 th Di. FX workshop, CSIRO ATNF

Memory @ the Core l l l Usually the biggest problem, memorywise Never used

Memory @ the Core l l l Usually the biggest problem, memorywise Never used to be a problem, but multifield center jobs hit hard Bigger subint means more memory (storing datastream baseband) More threads means more memory - at the pre-average spectral resolution Buffering more FFTs costs more (x the number of threads, too!) Adam Deller 6 th Di. FX workshop, CSIRO ATNF

Memory @ the Core l Tweaking: l l subint. NS n. Threads (threads file)

Memory @ the Core l Tweaking: l l subint. NS n. Threads (threads file) num. Buffered. FFTs And be aware of: l l n. FFTChans (for multiphase centre/high spectral resolution) Number of phase centres Adam Deller 6 th Di. FX workshop, CSIRO ATNF

Memory @ the Fx. Manager l l Tweaking: vis. Buffer. Length Multiplies the size

Memory @ the Fx. Manager l l Tweaking: vis. Buffer. Length Multiplies the size of a single visibility (n. Chan, n. Baselines, n. Phase. Centres) Adam Deller 6 th Di. FX workshop, CSIRO ATNF

Memory @ the Fx. Manager l l Tweaking: vis. Buffer. Length Multiplies the size

Memory @ the Fx. Manager l l Tweaking: vis. Buffer. Length Multiplies the size of a single visibility (n. Chan, n. Baselines, n. Phase. Centres) Generally not a problem Note: vis. Buffer. Length should not be too short, especially if you have many (esp. heterogeneous) Core nodes, as the subints can come in out of order Adam Deller 6 th Di. FX workshop, CSIRO ATNF

CPU @ the Datastream l Loading of Datastream is usually pretty light l l

CPU @ the Datastream l Loading of Datastream is usually pretty light l l A couple of options can cause problematically high loads: l l l But, Datastream often runs on old hardware (e. g. Mk 5 units) with limited CPU capacity Tsys extraction (. v 2 d: tcal. Freq = xx) Interlaced VDIF formats (used with multithread VDIF data, e. g. phased EVLA) More efficient implementations coming; for now, buy faster CPU if needed! Adam Deller 6 th Di. FX workshop, CSIRO ATNF

CPU @ the Core l Many considerations here, including parameters usually fixed by the

CPU @ the Core l Many considerations here, including parameters usually fixed by the science l l l Plus several on array management l l Number of phase centres Spectral resolution (n. Chan/n. FFTChan) stride. Length num. Buffered. FFTs xmac. Length And then a few others as well: l l n. Threads fringe rotation order Adam Deller 6 th Di. FX workshop, CSIRO ATNF

CPU @ the Core l Number of phase centers l For each phase centre,

CPU @ the Core l Number of phase centers l For each phase centre, phase rotation and separate accumulation from thread to main buffer Adam Deller 6 th Di. FX workshop, CSIRO ATNF

CPU @ the Core l Number of phase centers l l For each phase

CPU @ the Core l Number of phase centers l l For each phase centre, phase rotation and separate accumulation from thread to main buffer That costs CPU (proportional to number of baselines and number of phase centres), but also ensures that results don’t fit in cache (more later) Adam Deller 6 th Di. FX workshop, CSIRO ATNF

CPU @ the Core l Spectral resolution l l More channels means a bigger

CPU @ the Core l Spectral resolution l l More channels means a bigger FFT, and that costs CPU Doesn’t typically follow a log. N law like it should - bigger gets worse fast beyond ~1024 due to cache performance Really big (>=8192 channels/subband) gets very expensive Worst thing: typically comes in combination with multiple phase centres! (required to avoiding bandwidth smearing) Adam Deller 6 th Di. FX workshop, CSIRO ATNF

CPU @ the Core l Array management #1: stride. Length (auto setting usually best)

CPU @ the Core l Array management #1: stride. Length (auto setting usually best) 180° One FFT of data l -180° Adam Deller sin/cos the first “stride. Length” samples, and every “stride. Length”’th after that 6 th Di. FX workshop, CSIRO ATNF

CPU @ the Core l Array management l l #2: num. Buffered. FFTs (auto=10

CPU @ the Core l Array management l l #2: num. Buffered. FFTs (auto=10 usually ok) Mitigates the cache miss problem by x 10 Mode 1 Mode 2 Mode 3 … Mode N Precompute num. Buffered. FFTs FFT results, one station at a time But one slot fits in cache! Adam Deller Visibility buffer (too big for cache) 6 th Di. FX workshop, CSIRO ATNF

CPU @ the Core l Array management l #3: xmac. Length (auto setting of

CPU @ the Core l Array management l #3: xmac. Length (auto setting of 128 usually fine; further subdivides XMAC step) Mode 1 Mode 2 Mode 3 … Mode N Precompute num. Buffered. FFTs FFT results, one station at a time But one slot fits in cache! Adam Deller Visibility buffer (too big for cache) 6 th Di. FX workshop, CSIRO ATNF

CPU @ the Core l n. Threads l l Usually, set n. Threads =

CPU @ the Core l n. Threads l l Usually, set n. Threads = n(CPU cores) - 1 Occasionally, can be advantageous to use fewer threads (avoiding swap memory / cache contention) Adam Deller 6 th Di. FX workshop, CSIRO ATNF

CPU @ the Core l Fringe Rotation Order l l l Default is 1,

CPU @ the Core l Fringe Rotation Order l l l Default is 1, and this is almost always fine 2 nd order only ever needed for very high fringe rates with very long FFTs (space VLBI of masers? ) BUT: 0 th order could often be used, and almost never is: it can be about 25% faster Fringe rotation phase time 1 st FFT 2 nd FFT Here, fringe rate is too high for 0 th order Adam Deller 6 th Di. FX workshop, CSIRO ATNF

CPU @ the Core l Fringe Rotation Order l l l Default is 1,

CPU @ the Core l Fringe Rotation Order l l l Default is 1, and this is almost always fine 2 nd order only ever needed for very high fringe rates with very long FFTs (space VLBI of masers? ) BUT: 0 th order could often be used, and almost never is: it can be about 25% faster Fringe rotation phase time 1 st FFT 2 nd FFT But at low fringe rate, 0 th order approximation can be acceptable Adam Deller 6 th Di. FX workshop, CSIRO ATNF

CPU @ the Core l Fringe Rotation Order l l Default is 1, and

CPU @ the Core l Fringe Rotation Order l l Default is 1, and this is almost always fine 2 nd order only ever needed for very high fringe rates with very long FFTs (space VLBI of masers? ) BUT: 0 th order could often be used, and almost never is: it can be about 25% faster. v 2 d: fringe. Rot. Order = [0, 1, 2] Adam Deller 6 th Di. FX workshop, CSIRO ATNF

CPU @ the Fx. Manager l l CPU load at the Fx. Manager is

CPU @ the Fx. Manager l l CPU load at the Fx. Manager is typically light - it only does low-cadence accumulation and scaling of visibilities Very short subint. NS can potentially lead to problems (although network issues are more likely) Adam Deller 6 th Di. FX workshop, CSIRO ATNF

Questions? Adam Deller 6 th Di. FX workshop, CSIRO ATNF

Questions? Adam Deller 6 th Di. FX workshop, CSIRO ATNF