IBM Research Accelerating PFA FFT Performance Comparison Michael
IBM Research Accelerating PFA FFT: Performance Comparison Michael Perrone Acie Nobles Jizhu Lu 2007. 06 PFA FFT on Cell - M. Perrone, mpp@us. ibm. com
IBM Research Outline § PFA FFT Overview & Experimental Results § Implementation § Vectorization 2 PFA FFT on Cell - M. Perrone, mpp@us. ibm. com © 2007 IBM Corporation
IBM Research PFA FFT Algorithm Specifics § Prime-factor FFT algorithm (PFA) § 2 D FFT § Single precision § Complex-to-complex § Nominal size 1 K rows, 1600 points per row § Factors implemented: 2, 3, 4, 5, 7, 8, 9, 11, 13, 16 3 PFA FFT on Cell - M. Perrone, mpp@us. ibm. com © 2007 IBM Corporation
IBM Research Performance Comparison Cell vs Woodcrest Cell vs Opteron Execution Time Performance Comparison (40960 2 D images, in seconds) Matrix Size Intel AMD 3 SWO 2 SWO 364 x 240 16. 47 38. 8 6. 63 4. 74 5. 56 5. 31 616 x 308 45. 92 135. 59 11. 86 8. 21 9. 5 9. 05 840 x 462 146. 22 246. 09 24. 3 16. 96 18. 71 17. 83 1008 x 616 218. 24 393. 27 34. 72 23. 07 27. 58 26. 29 1260 x 840 416. 56 559. 05 59. 71 39. 94 50. 84 48. 38 1540 x 1008 687. 79 995. 49 86. 16 57. 65 79. 1 75. 66 4 PFA FFT on Cell - M. Perrone, mpp@us. ibm. com © 2007 IBM Corporation
IBM Research Performance: All PFA Sizes – 3 Step & 2 Step Algs 5 PFA FFT on Cell - M. Perrone, mpp@us. ibm. com © 2007 IBM Corporation
IBM Research Lessons Learned § NUMA utility – “numactl –m 0 –c 0” – Binds jobs to BEs – Binds memory to BEs § 2 runs instead of 1 § Changed buffer size – 4096 4104 elements – added one data envelope (128 B) – Better memory access pattern § Declare temporary variables locally § Combining 2 nd and 3 rd steps 6 PFA FFT on Cell - M. Perrone, mpp@us. ibm. com © 2007 IBM Corporation
IBM Research Outline § PFA FFT Overview & Experimental Results § Implementation § Vectorization 7 PFA FFT on Cell - M. Perrone, mpp@us. ibm. com © 2007 IBM Corporation
IBM Research Implementation Overview § FFT distributed across SPEs § Data vectorized § DMAs double buffered § Pass 1: For each buffer – DMA Get buffer Tile – Transform signals to SIMD format Buffer Input Image – Do four 1 D FFTs in SIMD – Tiles transposed – DMA Put buffer § Pass 2: For each buffer – DMA Get buffer – Do four 1 D FFTs in SIMD – Tiles transposed Transposed Image Transposed Tile – DMA Put buffer § Pass 3: For each buffer – DMA Get buffer – Transform SIMD format to original data format Transposed Buffer – DMA Put buffer 8 PFA FFT on Cell - M. Perrone, mpp@us. ibm. com © 2007 IBM Corporation
IBM Research Two Step PFA FFT Algorithm § 1 st Step – Get input data from main RAM by using DMA – Vectorization – Vectorized PFA FFT for 1 st dimension – Transpose and write back to main memory § 2 nd Step – Get input data from main RAM by using DMA – Vectorized PFA FFT for 2 nd dimension – Combined Transpose & Un-vectorization – Write back to main memory 9 PFA FFT on Cell - M. Perrone, mpp@us. ibm. com © 2007 IBM Corporation
IBM Research 2 nd Step Details Load buffer 1 Load buffer 2 PFAFFT on buffer 1 PFAFFT on buffer 2 Do combined transpose and unvectorization on buffer 1 & buffer 2 DMA back to main RAM in right places 10 PFA FFT on Cell - M. Perrone, mpp@us. ibm. com © 2007 IBM Corporation
IBM Research Time Distribution on 2 nd Step Begin of the loop Load Comp Trans Unload Comp Efficiency = 6/13 = ~50% Load Comp Load Trans Unload Comp Load Comp Trans Unload Load Trans Unload Comp Load End of the loop Load buf[0] Load buf[1] Comp buf[0] Load buf[2] Load buf[1] Comp buf[2] Load buf[0] Comp buf[1] T & UNLD buf[0] buf[1] 11 Comp buf[1] T & UNLD buf[1] buf[2] Load buf[1] Comp buf[0] Load buf[2] Comp buf[1] T & UNLD buf[0] buf[1] PFA FFT on Cell - M. Perrone, mpp@us. ibm. com © 2007 IBM Corporation
IBM Research Outline § PFA FFT Overview & Experimental Results § Implementation § Vectorization 12 PFA FFT on Cell - M. Perrone, mpp@us. ibm. com © 2007 IBM Corporation
Data Layout Change in 2 -Step PFAFFT 3 rd buffer 4 th buffer e 1 e 2 e 3 e 4 e 5 e 6 e 7 e 8 f 1 f 2 f 3 f 4 5 f 6 f 7 f 8 g 1 g 2 g 3 g 4 g 5 g 6 g 7 g 8 h 1 h 2 h 3 h 4 h 5 h 6 h 7 h 8 i 1 i 2 i 3 i 4 i 5 i 6 i 7 i 8 j 1 j 2 j 3 j 4 j 5 j 6 j 7 j 8 k 1 k 2 k 3 k 4 k 5 k 6 k 7 k 8 l 1 l 2 l 3 l 4 l 5 l 6 l 7 l 8 2 nd buffer m 1 m 2 m 3 m 4 m 5 m 6 m 7 m 8 n 1 n 2 n 3 n 4 n 5 n 6 n 7 n 8 o 1 o 2 o 3 o 4 o 5 o 6 o 7 o 8 p 1 p 2 p 3 p 4 p 5 p 6 p 7 p 8 1 st buffer a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 b 1 b 2 b 3 b 4 b 5 b 6 b 7 b 8 c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 d 1 d 2 d 3 d 4 d 5 d 6 d 7 d 8 Original Input Data (each trace 4 complex numbers x 16 traces) real imaginary 13 real imaginary PFA FFT on Cell - M. Perrone, mpp@us. ibm. com real imaginary © 2007 IBM Corporation
Vectorization Shuffle Operation in 1 st Step 14 a 8 b 8 c 8 d 8 imaginary real a 4 b 4 c 4 d 4 a 2 b 2 c 2 d 2 a 1 b 1 c 1 d 1 real d 1 d 2 d 3 d 4 d 5 d 6 d 7 d 8 real a 7 b 7 c 7 d 7 real a 6 b 6 c 6 d 6 real c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 real a 5 b 5 c 5 d 5 imaginary b 1 b 2 b 3 b 4 b 5 b 6 b 7 b 8 imaginary a 3 b 3 c 3 d 3 imaginary a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 imaginary PFA FFT on Cell - M. Perrone, mpp@us. ibm. com © 2007 IBM Corporation
15 3 rd buffer i 1 j 1 k 1 l 1 i 2 j 2 k 2 l 2 i 3 j 3 k 3 l 3 i 4 j 4 k 4 l 4 i 5 j 5 k 5 l 5 i 6 j 6 k 6 l 6 i 7 j 7 k 7 l 7 i 8 j 8 k 8 l 8 4 th buffer m 1 n 1 o 1 p 1 m 2 n 2 o 2 p 2 m 3 n 3 o 3 p 3 m 4 n 4 o 4 p 4 m 5 n 5 o 5 p 5 m 6 n 6 o 6 p 6 m 7 n 7 o 7 p 7 m 8 n 8 o 8 p 8 1 st buffer a 1 b 1 c 1 d 1 a 2 b 2 c 2 d 2 a 3 b 3 c 3 d 3 a 4 b 4 c 4 d 4 a 5 b 5 c 5 d 5 a 6 b 6 c 6 d 6 a 7 b 7 c 7 d 7 a 8 b 8 c 8 d 8 2 nd buffer e 1 f 1 g 1 h 1 e 2 f 2 g 2 h 2 e 3 f 3 g 3 h 3 e 4 f 4 g 4 h 4 e 5 f 5 g 5 h 5 e 6 f 6 g 6 h 6 e 7 f 7 g 7 h 7 e 8 f 8 g 8 h 8 After Vectorization in 1 st Step real imaginary real PFA FFT on Cell - M. Perrone, mpp@us. ibm. com imaginary real imaginary © 2007 IBM Corporation
16 3 rd buffer i 1 j 1 k 1 l 1 i 2 j 2 k 2 l 2 i 3 j 3 k 3 l 3 i 4 j 4 k 4 l 4 i 5 j 5 k 5 l 5 i 6 j 6 k 6 l 6 i 7 j 7 k 7 l 7 i 8 j 8 k 8 l 8 4 th buffer m 1 n 1 o 1 p 1 m 2 n 2 o 2 p 2 m 3 n 3 o 3 p 3 m 4 n 4 o 4 p 4 m 5 n 5 o 5 p 5 m 6 n 6 o 6 p 6 m 7 n 7 o 7 p 7 m 8 n 8 o 8 p 8 1 st buffer a 1 b 1 c 1 d 1 a 2 b 2 c 2 d 2 a 3 b 3 c 3 d 3 a 4 b 4 c 4 d 4 a 5 b 5 c 5 d 5 a 6 b 6 c 6 d 6 a 7 b 7 c 7 d 7 a 8 b 8 c 8 d 8 2 nd buffer e 1 f 1 g 1 h 1 e 2 f 2 g 2 h 2 e 3 f 3 g 3 h 3 e 4 f 4 g 4 h 4 e 5 f 5 g 5 h 5 e 6 f 6 g 6 h 6 e 7 f 7 g 7 h 7 e 8 f 8 g 8 h 8 After PFA FFT for 1 st Dimension real imaginary PFA FFT on Cell - M. Perrone, mpp@us. ibm. com real imaginary © 2007 IBM Corporation
Transposition Shuffle Operation in 1 st Step 17 a 6 b 6 c 6 d 6 a 7 b 7 c 7 d 7 c 2 c 4 c 6 c 8 d 1 d 3 d 5 d 7 d 2 d 4 d 6 d 8 real imaginary PFA FFT on Cell - M. Perrone, mpp@us. ibm. com real a 8 b 8 c 8 d 8 a 5 b 5 c 5 d 5 c 1 c 3 c 5 c 7 imaginary a 4 b 4 c 4 d 4 real a 3 b 3 c 3 d 3 imaginary b 1 b 3 b 5 b 7 b 2 b 4 b 6 b 8 real a 2 b 2 c 2 d 2 imaginary a 2 a 4 a 6 a 8 real a 1 b 1 c 1 d 1 imaginary a 1 a 3 a 5 a 7 real imaginary © 2007 IBM Corporation
18 e 1 e 3 e 5 e 7 e 2 e 4 e 6 e 8 f 1 f 3 f 5 f 7 f 2 f 4 f 6 f 8 g 1 g 3 g 5 g 7 g 2 g 4 g 6 g 8 h 1 h 3 h 5 h 7 h 2 h 4 h 6 h 8 3 rd buffer i 1 i 3 i 5 i 7 i 2 i 4 i 6 i 8 j 1 j 3 j 5 j 7 j 2 j 4 j 6 j 8 k 1 k 3 k 5 k 7 k 2 k 4 k 6 k 8 l 1 l 3 l 5 l 7 l 2 l 4 l 6 l 8 4 th buffer m 2 m 4 m 6 m 8 n 1 n 3 n 5 n 7 o 1 o 3 o 5 o 7 o 2 o 4 o 6 o 8 p 1 p 3 p 5 p 7 p 2 p 4 p 6 p 8 1 st buffer PFA FFT on Cell - M. Perrone, mpp@us. ibm. com real d 2 d 4 d 6 d 8 imaginary d 1 d 3 d 5 d 7 real c 2 c 4 c 6 c 8 imaginary c 1 c 3 c 5 c 7 real b 2 b 4 b 6 b 8 imaginary b 1 b 3 b 5 b 7 a 2 a 4 a 6 a 8 a 1 a 3 a 5 a 7 real n 2 n 4 n 6 n 8 2 nd buffer m 1 m 3 m 5 m 7 After Transposition DMA back to main RAM in 1 st Step imaginary © 2007 IBM Corporation
19 i 1 i 3 i 5 i 7 i 2 i 4 i 6 i 8 j 1 j 3 j 5 j 7 j 2 j 4 j 6 j 8 k 1 k 3 k 5 k 7 k 2 k 4 k 6 k 8 l 1 l 3 l 5 l 7 l 2 l 4 l 6 l 8 m 1 m 3 m 5 m 7 m 2 m 4 m 6 m 8 n 1 n 3 n 5 n 7 n 2 n 4 n 6 n 8 o 1 o 3 o 5 o 7 o 2 o 4 o 6 o 8 p 1 p 3 p 5 p 7 PFA FFT on Cell - M. Perrone, mpp@us. ibm. com p 2 p 4 p 6 p 8 d 2 d 4 d 6 d 8 real h 2 h 4 h 6 h 8 imaginary d 1 d 3 d 5 d 7 real h 1 h 3 h 5 h 7 imaginary c 2 c 4 c 6 c 8 c 1 c 3 c 5 c 7 g 1 g 3 g 5 g 7 real g 2 g 4 g 6 g 8 b 2 b 4 b 6 b 8 f 2 f 4 f 6 f 8 imaginary b 1 b 3 b 5 b 7 a 2 a 4 a 6 a 8 e 2 e 4 e 6 e 8 real f 1 f 3 f 5 f 7 a 1 a 3 a 5 a 7 e 1 e 3 e 5 e 7 After DMA Load in 2 nd Step from main RAM (all in 1 buffer) imaginary © 2007 IBM Corporation
20 b 2 b 4 b 6 b 8 c 1 c 3 c 5 c 7 c 2 c 4 c 6 c 8 d 1 d 3 d 5 d 7 d 2 d 4 d 6 d 8 f 2 f 4 f 6 f 8 g 1 g 3 g 5 g 7 g 2 g 4 g 6 g 8 h 1 h 3 h 5 h 7 h 2 h 4 h 6 h 8 k 2 k 4 k 6 k 8 l 1 l 3 l 5 l 7 l 2 l 4 l 6 l 8 o 2 o 4 o 6 o 8 p 1 p 3 p 5 p 7 p 2 p 4 p 6 p 8 n 2 n 4 n 6 n 8 o 1 o 3 o 5 o 7 j 2 j 4 j 6 j 8 k 1 k 3 k 5 k 7 j 1 j 3 j 5 j 7 n 1 n 3 n 5 n 7 a 2 a 4 a 6 a 8 imaginary b 1 b 3 b 5 b 7 e 2 e 4 e 6 e 8 i 2 i 4 i 6 i 8 m 2 m 4 m 6 m 8 a 1 a 3 a 5 a 7 real f 1 f 3 f 5 f 7 e 1 e 3 e 5 e 7 i 1 i 3 i 5 i 7 m 1 m 3 m 5 m 7 After PFA FFT for 2 nd Dimension (just 1 buffer) real imaginary PFA FFT on Cell - M. Perrone, mpp@us. ibm. com real imaginary © 2007 IBM Corporation
Transposition & Un-Vectorization Shuffle Operation in 2 nd Step real imaginary PFA FFT on Cell - M. Perrone, mpp@us. ibm. com real imaginary d 1 d 3 d 5 d 7 d 2 d 4 d 6 d 8 real d 1 d 2 d 3 d 4 d 5 d 6 d 7 d 8 c 2 c 4 c 6 c 8 c 5 c 6 c 7 c 8 imaginary c 1 c 3 c 5 c 7 real c 1 c 2 c 3 c 4 b 1 b 3 b 5 b 7 b 2 b 4 b 6 b 8 b 1 b 2 b 3 b 4 imaginary b 5 b 6 b 7 b 8 a 2 a 4 a 6 a 8 imaginary a 5 a 6 a 7 a 8 real a 1 a 3 a 5 a 7 real 21 imaginary a 1 a 2 a 3 a 4 real imaginary © 2007 IBM Corporation
22 real imaginary PFA FFT on Cell - M. Perrone, mpp@us. ibm. com real imaginary d 1 d 2 d 3 d 4 d 5 d 6 d 7 d 8 h 1 h 2 h 3 h 4 h 5 h 6 h 7 h 8 l 1 l 2 l 3 l 4 l 5 l 6 l 7 l 8 p 5 p 6 p 7 p 8 c 5 c 6 c 7 c 8 g 5 g 6 g 7 g 8 k 5 k 6 k 7 k 8 o 5 o 6 o 7 o 8 p 1 p 2 p 3 p 4 c 1 c 2 c 3 c 4 b 5 b 6 b 7 b 8 f 5 f 6 f 7 f 8 j 5 j 6 j 7 j 8 n 5 n 6 n 7 n 8 g 1 g 2 g 3 g 4 b 1 b 2 b 3 b 4 f 1 f 2 f 3 f 4 j 1 j 2 j 3 j 4 n 1 n 2 n 3 n 4 k 1 k 2 k 3 k 4 a 5 a 6 a 7 a 8 e 5 e 6 e 7 e 8 i 5 i 6 i 7 i 8 m 5 m 6 m 7 m 8 o 1 o 2 o 3 o 4 a 1 a 2 a 3 a 4 e 1 e 2 e 3 e 4 i 1 i 2 i 3 i 4 m 1 m 2 m 3 m 4 After Combined Transposition and Un-vectorization Shuffle DMA back to main RAM real imaginary © 2007 IBM Corporation
- Slides: 22