Tuning Di FX 2 for performance Adam Deller
- Slides: 35
Tuning Di. FX 2 for performance Adam Deller ASTRON 6 th Di. FX workshop, CSIRO ATNF, Sydney AUS
Outline l I/O bottlenecks and solutions l l l Communication with the real world (reading raw data, writing visibilities) Interprocess communication Keeping out of memory trouble Minimizing CPU load in various corners of parameter space For more information and pictures: http: //cira. ivec. org/dokuwiki/doku. php/difx/mpifxcorr/ Adam Deller 6 th Di. FX workshop, CSIRO ATNF
Getting data into Di. FX Data. Stream 1 Baseband data processing buffer Core 1 processing buffer Data. Stream 2 Source data … Core 2 … Data. Stream N Core M Large, segmented ring buffer Timerange, destination Visibilities Master Node Visibility buffer Adam Deller 6 th Di. FX workshop, CSIRO ATNF
Getting data into Di. FX l l How to test? neutered_difx, with a small number of channels Fundamental limit: native transfer speed (disk read, network pipe) l l Potential troublemaker: CPU utilisation on datastream node (competition) l l If this is the problem, buy a RAID or get infiniband or … Can come from tsys estimation Tweaking: datastream databuffer Adam Deller 6 th Di. FX workshop, CSIRO ATNF
Datastream databuffer /“Subint” Key parameters: data. Buffer. Factor n. Data. Segments subint. NS Only real potential problem I/O-wise: buffer too short (databufferfactor) Adam Deller 6 th Di. FX workshop, CSIRO ATNF
Getting visibilities out of Di. FX Data. Stream 1 Baseband data processing buffer Core 1 processing buffer Data. Stream 2 Source data … Core 2 … Data. Stream N Core M Large, segmented ring buffer Timerange, destination Visibilities Master Node Visibility buffer Adam Deller To disk 6 th Di. FX workshop, CSIRO ATNF
Getting visibilities out of Di. FX l l Fx. Manager writes the visibilities to disk This is very rarely a problem unless you have a dying disk or very large and/or frequent visibility dumps Testing: neutered_difx + fake data source (ensures good input speeds) Tweaking: none l If you want to write out visibilities faster, put a fast disk (probably RAID) on the manager node! Adam Deller 6 th Di. FX workshop, CSIRO ATNF
Interprocess @ the Datastream Data. Stream 1 Baseband data processing buffer Core 1 processing buffer Data. Stream 2 Source data … Core 2 … Data. Stream N Core M Large, segmented ring buffer Timerange, destination Visibilities Master Node Visibility buffer Adam Deller 6 th Di. FX workshop, CSIRO ATNF
Interprocess @ the Datastream l l l Generally not a problem Tweaking: data. Buffer. Factor, ensure reasonable size (avoids latency issues) Default (32) generally ok but could usually be bigger w/o problems (increase n. Segments also) Adam Deller 6 th Di. FX workshop, CSIRO ATNF
Interprocess @ the Core Data. Stream 1 Baseband data processing buffer Core 1 processing buffer Data. Stream 2 Source data … Core 2 … Data. Stream N Core M Large, segmented ring buffer Timerange, destination Visibilities Master Node Visibility buffer Tweaking: • subint. NS • Output visibility size (n. Chan / n. Baselines) Adam Deller 6 th Di. FX workshop, CSIRO ATNF
Interprocess @ the Core Adam Deller 6 th Di. FX workshop, CSIRO ATNF
Interprocess @ the Core l In terms of reducing data transmission, increasing subint. NS is the only real knob to turn l l Unimportant for continuum, single phase centre - it’s only very high spectral resolution and/or multiphase centre where this is relevant In those cases, bigger is better; but be careful about memory (later) Adam Deller 6 th Di. FX workshop, CSIRO ATNF
Interprocess @ the Fx. Manager Data. Stream 1 Baseband data processing buffer Core 1 processing buffer Data. Stream 2 Source data … Core 2 … Data. Stream N Core M Large, segmented ring buffer Timerange, destination Visibilities Master Node Visibility buffer The most common trouble point! Must aggregate data from all Core nodes, can lead to high data rates Adam Deller 6 th Di. FX workshop, CSIRO ATNF
Interprocess @ the Fx. Manager Adam Deller 6 th Di. FX workshop, CSIRO ATNF
Interprocess @ the Fx. Manager l l To calculate the rate into Fx. Manager, work out the rate for one Core node and scale Tweaking: maximise subint. NS! Or (although this is usually not possible) reduce visibility size (via n. Chan or the number of phase centers) Adam Deller 6 th Di. FX workshop, CSIRO ATNF
Memory @ the Datastream l Just don’t make the combination of data. Buffer. Factor and subint. NS too big (can also control via “send. Size”) Adam Deller 6 th Di. FX workshop, CSIRO ATNF
Memory @ the Core l Usually the biggest problem, memorywise Adam Deller 6 th Di. FX workshop, CSIRO ATNF
Memory @ the Core l l l Usually the biggest problem, memorywise Never used to be a problem, but multifield center jobs hit hard Bigger subint means more memory (storing datastream baseband) More threads means more memory - at the pre-average spectral resolution Buffering more FFTs costs more (x the number of threads, too!) Adam Deller 6 th Di. FX workshop, CSIRO ATNF
Memory @ the Core l Tweaking: l l subint. NS n. Threads (threads file) num. Buffered. FFTs And be aware of: l l n. FFTChans (for multiphase centre/high spectral resolution) Number of phase centres Adam Deller 6 th Di. FX workshop, CSIRO ATNF
Memory @ the Fx. Manager l l Tweaking: vis. Buffer. Length Multiplies the size of a single visibility (n. Chan, n. Baselines, n. Phase. Centres) Adam Deller 6 th Di. FX workshop, CSIRO ATNF
Memory @ the Fx. Manager l l Tweaking: vis. Buffer. Length Multiplies the size of a single visibility (n. Chan, n. Baselines, n. Phase. Centres) Generally not a problem Note: vis. Buffer. Length should not be too short, especially if you have many (esp. heterogeneous) Core nodes, as the subints can come in out of order Adam Deller 6 th Di. FX workshop, CSIRO ATNF
CPU @ the Datastream l Loading of Datastream is usually pretty light l l A couple of options can cause problematically high loads: l l l But, Datastream often runs on old hardware (e. g. Mk 5 units) with limited CPU capacity Tsys extraction (. v 2 d: tcal. Freq = xx) Interlaced VDIF formats (used with multithread VDIF data, e. g. phased EVLA) More efficient implementations coming; for now, buy faster CPU if needed! Adam Deller 6 th Di. FX workshop, CSIRO ATNF
CPU @ the Core l Many considerations here, including parameters usually fixed by the science l l l Plus several on array management l l Number of phase centres Spectral resolution (n. Chan/n. FFTChan) stride. Length num. Buffered. FFTs xmac. Length And then a few others as well: l l n. Threads fringe rotation order Adam Deller 6 th Di. FX workshop, CSIRO ATNF
CPU @ the Core l Number of phase centers l For each phase centre, phase rotation and separate accumulation from thread to main buffer Adam Deller 6 th Di. FX workshop, CSIRO ATNF
CPU @ the Core l Number of phase centers l l For each phase centre, phase rotation and separate accumulation from thread to main buffer That costs CPU (proportional to number of baselines and number of phase centres), but also ensures that results don’t fit in cache (more later) Adam Deller 6 th Di. FX workshop, CSIRO ATNF
CPU @ the Core l Spectral resolution l l More channels means a bigger FFT, and that costs CPU Doesn’t typically follow a log. N law like it should - bigger gets worse fast beyond ~1024 due to cache performance Really big (>=8192 channels/subband) gets very expensive Worst thing: typically comes in combination with multiple phase centres! (required to avoiding bandwidth smearing) Adam Deller 6 th Di. FX workshop, CSIRO ATNF
CPU @ the Core l Array management #1: stride. Length (auto setting usually best) 180° One FFT of data l -180° Adam Deller sin/cos the first “stride. Length” samples, and every “stride. Length”’th after that 6 th Di. FX workshop, CSIRO ATNF
CPU @ the Core l Array management l l #2: num. Buffered. FFTs (auto=10 usually ok) Mitigates the cache miss problem by x 10 Mode 1 Mode 2 Mode 3 … Mode N Precompute num. Buffered. FFTs FFT results, one station at a time But one slot fits in cache! Adam Deller Visibility buffer (too big for cache) 6 th Di. FX workshop, CSIRO ATNF
CPU @ the Core l Array management l #3: xmac. Length (auto setting of 128 usually fine; further subdivides XMAC step) Mode 1 Mode 2 Mode 3 … Mode N Precompute num. Buffered. FFTs FFT results, one station at a time But one slot fits in cache! Adam Deller Visibility buffer (too big for cache) 6 th Di. FX workshop, CSIRO ATNF
CPU @ the Core l n. Threads l l Usually, set n. Threads = n(CPU cores) - 1 Occasionally, can be advantageous to use fewer threads (avoiding swap memory / cache contention) Adam Deller 6 th Di. FX workshop, CSIRO ATNF
CPU @ the Core l Fringe Rotation Order l l l Default is 1, and this is almost always fine 2 nd order only ever needed for very high fringe rates with very long FFTs (space VLBI of masers? ) BUT: 0 th order could often be used, and almost never is: it can be about 25% faster Fringe rotation phase time 1 st FFT 2 nd FFT Here, fringe rate is too high for 0 th order Adam Deller 6 th Di. FX workshop, CSIRO ATNF
CPU @ the Core l Fringe Rotation Order l l l Default is 1, and this is almost always fine 2 nd order only ever needed for very high fringe rates with very long FFTs (space VLBI of masers? ) BUT: 0 th order could often be used, and almost never is: it can be about 25% faster Fringe rotation phase time 1 st FFT 2 nd FFT But at low fringe rate, 0 th order approximation can be acceptable Adam Deller 6 th Di. FX workshop, CSIRO ATNF
CPU @ the Core l Fringe Rotation Order l l Default is 1, and this is almost always fine 2 nd order only ever needed for very high fringe rates with very long FFTs (space VLBI of masers? ) BUT: 0 th order could often be used, and almost never is: it can be about 25% faster. v 2 d: fringe. Rot. Order = [0, 1, 2] Adam Deller 6 th Di. FX workshop, CSIRO ATNF
CPU @ the Fx. Manager l l CPU load at the Fx. Manager is typically light - it only does low-cadence accumulation and scaling of visibilities Very short subint. NS can potentially lead to problems (although network issues are more likely) Adam Deller 6 th Di. FX workshop, CSIRO ATNF
Questions? Adam Deller 6 th Di. FX workshop, CSIRO ATNF
- Tweaking
- Adam deller
- Deller
- Xampp performance tuning
- Jörg stryk
- Ssas performance tuning
- Informix performance tuning
- Mainframe mips optimization
- Database performance tuning and query optimization
- Performance tuning in abap
- Data warehouse performance tuning
- Sshfs performance tuning
- Oracle sql tuning tutorial
- Ms access performance analyzer
- York university moodle
- Ssis data flow performance tuning
- Improve terminal server performance
- Apache performance tuning
- Maximo performance tuning
- Glusterfs performance tuning
- Harrison performance and tuning
- Toad performance tuning
- Sql server 2005 performance monitor
- Informix performance tuning
- Erm performance tuning
- Adam white speaks
- Vem räknas som jude
- Cellorov
- Jag har gått inunder stjärnor text
- Sju för caesar
- Tack för att ni lyssnade bild
- Vad är verksamhetsanalys
- Novell typiska drag
- Varför kallas perioden 1918-1939 för mellankrigstiden
- Borstål, egenskaper
- Tack för att ni har lyssnat