SIMD Processor Extensions Houffaneh Osman halio 029uottawa ca

SIMD (1) Single Instruction, Multiple Data Part of Flynn Taxonomy computer classification Multiple processors

SIMD (2) Able to operates on multiple data items at the same time Computation

SIMD Architecture Two type of processors ◦ True SIMD ◦ Pipelined SIMD Divide a

True SIMD - Distributed Memory Single control unit M processing elements act as arithmetic

True SIMD - Shared Memory Single control unit M processing elements act as arithmetic

True SIMD : Distributed Memory True SIMD : Shared Memory

Programming the SIMD architecture (1) Cell used : IBM Cell BE The Cell Broadband

Programming the SIMD architecture (2) VMX : Vector Multimedia e. Xtension to the Power.

Programming the SIMD architecture (3) Each of the 4 elements in VA and VB

Programming the SIMD architecture (4) SIMD Unprocessable Patterns ◦ Case where the instruction differ

Programming the SIMD architecture (5) Register view of the add instruction in previous slide

Programming the SIMD architecture (6) Permute method or shuffling ◦ Between two vector ◦

Intel SSE (1) SSE : Streaming SIMD Extensions ◦ Instruction set to the x

Intel SSE (2) Image Processing Digital Signal Processing Encoding Streaming load

Intel SSE (3) Streaming load instruction ◦ Enables faster read ◦ Improves performance of

Matrix Multiplication (NI) Matrix multiplication – No data parallelism Matrix multiplication – Employed data

Implementation of SIMD Native vs Traditional programming Auto-vectorization ◦ Detection of low-level operation ◦

Auto-Parallelization Intel C++ Compiler ◦ Serial section of input program -> multithreaded code ◦

Auto-Vectorization GNU Compiler for C and C++ ◦ Nested Loops conditions ◦ Multidimensional arrays

Intel Array Building Blocks (1) Developed to utilized ◦ Multi-core processors ◦ Graphics processing

Intel Array Building Blocks (2) Isolated data objects from rest of codes ◦ Intel

References (1) Intel Press, “Multi-Core Programming : Increasing Performance through Software Multithreading, '' pp.

References (2) GCC GNU Project, “Auto-vectorization in GCC, ". Internet: http: //gcc. gnu. org/projects/tree-ssa/vectorization.

References (3) IBM Corp and Sony Computer Entertainment, “Software Development Kit for Multicore Acceleration

References (4) L. Dong-hwan, S. Wonyong, ``Importance of SIMD computation reconsidered, ''Parallel and Distributed

Slides: 29

Download presentation

SIMD Processor Extensions Houffaneh Osman halio 029@uottawa. ca

SIMD (1) Single Instruction, Multiple Data Part of Flynn Taxonomy computer classification Multiple processors ◦ Different data streams Same instruction executed

SIMD (2) Able to operates on multiple data items at the same time Computation : The most minimal time possible ◦ Vectors ◦ Matrices Better speedup then sequential

SIMD Architecture Two type of processors ◦ True SIMD ◦ Pipelined SIMD Divide a instruction into smaller function Execute smaller function in parallel on different data

True SIMD - Distributed Memory Single control unit M processing elements act as arithmetic unit N data elements (or even more then M) Processor elements receives instruction from control unit If a processor element need information from another processor element ◦ Send request to control unit and it manage the memory exchanges

True SIMD - Shared Memory Single control unit M processing elements act as arithmetic unit N data elements (or even more then M) Processor elements receives instruction from control unit Processing elements able to share their memory without control unit access

True SIMD : Distributed Memory True SIMD : Shared Memory

Programming the SIMD architecture (1) Cell used : IBM Cell BE The Cell Broadband Engine (CBE) ◦ Single-chip multiprocessor with 9 processor ◦ All processor share the same main storage Processor function used in 2 functions ◦ Power. PC Processor Element (PPE) ◦ Synergistic Processor Element (SPE)

Programming the SIMD architecture (2) VMX : Vector Multimedia e. Xtension to the Power. PC architecture ◦ Utilizes data parallelism for faster performance SIMD in VMX and SPE (Reference IBM Cell Programming) ◦ 128 bit-wide datapath ◦ 128 bit-wide registers ◦ 4 -wide fullwords, 8 -wide halfwords, 16 -wide bytes ◦ SPE includes support for 2 -wide doublewords Vector Programming

Programming the SIMD architecture (3) Each of the 4 elements in VA and VB are added and their sum placed in VC VC = vec_add(VA, VB)

Programming the SIMD architecture (4) SIMD Unprocessable Patterns ◦ Case where the instruction differ for each processing element SIMD Processable Patterns ◦ Case where the instruction are the same for each processing element

Programming the SIMD architecture (5) Register view of the add instruction in previous slide VC = vec_add(VA, VB)

Programming the SIMD architecture (6) Permute method or shuffling ◦ Between two vector ◦ Third vector used for control vector VT = vec_perm(VA, VB, VC)

Intel SSE (1) SSE : Streaming SIMD Extensions ◦ Instruction set to the x 86 architectures ◦ Extension of 128 -bit Introduced in 1999 in the Pentium III ◦ Latest version : SSE 5 before revision Future extension from Intel ◦ AVX : Advanced Vector Extensions ◦ 256 -bit instructions

Intel SSE (2) Image Processing Digital Signal Processing Encoding Streaming load

Intel SSE (3) Streaming load instruction ◦ Enables faster read ◦ Improves performance of application that ‘s using the GPU and CPU SIMD improve encoding speed ◦ Required arithmetic performed on pixel Pixel in a video -> high level of parallelism required

Data Parallelism

Matrix Multiplication (NI) Matrix multiplication – No data parallelism Matrix multiplication – Employed data parallelism

Image Processing

Processing

Implementation of SIMD Native vs Traditional programming Auto-vectorization ◦ Detection of low-level operation ◦ Convert these sequential program to process 2 to up to 16 elements in one operation Auto-parallization ◦ Turning sequential code into multi-threaded

Auto-Parallelization Intel C++ Compiler ◦ Serial section of input program -> multithreaded code ◦ Compiler also efficient in order to not have too much overhead when creating multithreads Intel® Architecture Code Analyzer PGI CDK Cluster Development Kit ◦ AMD Opteron ◦ Intel Core 2

Auto-Vectorization GNU Compiler for C and C++ ◦ Nested Loops conditions ◦ Multidimensional arrays PGI CDK Cluster Development Kit ◦ SSE vectorization

Intel Array Building Blocks (1) Developed to utilized ◦ Multi-core processors ◦ Graphics processing units Takes advantages of the SIMD and core processing elements Portion of C/C++ code that have parallelism can be used in conjunction with Ar. BB

Intel Array Building Blocks (2) Isolated data objects from rest of codes ◦ Intel mention this imposes a restrictions ◦ Restrictions eliminates locks and data races Threading by itself ◦ Do not provide access to per-core vector parallelism Ar. BB API provides programming models at software level for developers

References (1) Intel Press, “Multi-Core Programming : Increasing Performance through Software Multithreading, '' pp. 2 --6 -- 11 --13, Apr 2006. Intel Corp. “Intel C++ Compiler 8. 1 for Linux, ” Internet: ftp: //download. intel. com/support/performancetools/c/linux/sb/clin 81_relnotes. pdf, 2004 pg 1 --9. [2010 -10 -24] Linux Kernel Organization, “Cell Programming Primer : Basics of SIMD programming, Documents of PS 3 Linux Distributor's Starter Kit, Internet: http: //www. kernel. org/pub/linux/kernel/people/geoff/cell/ps 3 -linux-docs/ps 3 -linuxdocs-08. 06. 09/Cell. Programming. Tutorial/Basics. Of. SIMDProgramming. html, 2006, 2007, 2008 [Oct. 24, 2010]. C. Chen, R. Raghavan, J. Dale, E. Iwata, “Cell Broadband Engine Architecture and its first implementation, ". Internet: http: //www. ibm. com/developerworks/power/library/pacellperf/, Oct. 2005 [Oct. 24, 2010]. H. Chang, C. Cho, S. Wonyong, “Performance Evaluation of an SIMD Architecture with a Multi-bank Vector Memory Unit, Signal Processing Systems Design and Implementation, 2006. SIPS '06. IEEE Workshop on}, oct. 2006, pp. 1520 -6130.

References (2) GCC GNU Project, “Auto-vectorization in GCC, ". Internet: http: //gcc. gnu. org/projects/tree-ssa/vectorization. html, Aug. 2010 [Oct. 24, 2010]. Intel Software Network, “Performance Tools for Software Developers - Auto parallelization and /Qpar-threshold, ". Internet: http: //software. intel. com/enus/articles/performance-tools-for-software-developers-auto-parallelization-and-qpar -threshold/, Jul. 2009 [Oct. 24, 2010]. National Instruments, “Programming Strategies for Multicore Processing: Data Parallelism, ". Internet: http: //zone. ni. com/devzone/cda/tut/p/id/6421, Nov. 2008 [Oct. 24, 2010]. A. Lanterman, “Multicore and GPU Programming for Video Games: Developing Code for Cell - SIMD". Internet: http: //users. ece. gatech. edu/~lanterma/mpg 09/, Fall 2010 [Oct. 24, 2010]. R. Michael Hord, "Parallel supercomputing in SIMD architectures, " Boca Raton, FL: CRC Press, c 1990

References (3) IBM Corp and Sony Computer Entertainment, “Software Development Kit for Multicore Acceleration Version 3. 0: Data Parallelism, ". Internet: http: //users. ece. gatech. edu/~lanterma/mpg 09/CBE_Programming_Tutorial_v 3. 0. pdf, Nov. 2008 [Oct. 24, 2010]. IBM Corp and Sony Computer Entertainment (2006, 2007). "Software Development Kit for Multicore Acceleration (Version 3). [On-line], ", Internet: http: //users. ece. gatech. edu/~lanterma/mpg 09/CBE_Programming_Tutorial_v 3. 0. pdf"[Oc t. 24, 2010]. J. Demmel, "A closer look at parallel architectures: Lecture 9, " Internet: http: //www. eecs. berkeley. edu/~demmel/cs 267/lecture 09. html, Feb. 1996 [Oct. 24, 2010]. S. Morse, "Practical parallel computing , " Boston : AP Professional, c 1994 C. Leopold, "Parallel and distributed computing : a survey of models, paradigms and approaches , " New York : Wiley, 2001

References (4) L. Dong-hwan, S. Wonyong, ``Importance of SIMD computation reconsidered, ''Parallel and Distributed Processing Symposium, 2003. Proceedings. International}, apr. 2003, pp. 8. W. C. Meilander, J. W. Baker, M. Jin, ``Performance Evaluation of an SIMD Architecture with a Multi-bank Vector Memory Unit, '', Signal Processing Systems Design and Implementation, 2006. SIPS '06. IEEE Workshop on}, oct. 2006, pp. 1520 -6130. http: //www. gamasutra. com/view/feature/4248/designing_fast_crossplatform_simd_. ph p http: //domino. watson. ibm. com/comm/research. nsf/pages/r. arch. simd. html Intel Array Building Blocks : http: //software. intel. com/en-us/articles/intel-arraybuilding-blocks/ http: //www. wolfire. com/