A study on SIMD architecture CDA 5106 Group

Outline • • • Introduction to SIMD & brief history SIMD architecture Matrix multiplication

Introduction to SIMD • Single instruction, multiple data

Introduction to SIMD • SIMD processors: o has a single Control Unit o reading

Brief history • • • First use in vector supercomputers, early 1970 s (e.

Brief history • • • Sun Microsystems introduced SIMD integer instructions in VIS (visual

SIMD architecture • Exploits a property of data stream called "data paralelism" • •

SIMD architecture • • One obvious op type, intra-element arithmetic (e. g. addition) and

SIMD architecture - Alti. Vec • • Alti. Vec has 4 distinct registers: 2

SIMD architecture - Intel MMX/SSE • • Intel added an extra 8, 128 -bit

Main idea Using SIMD instruction set for improving programs performance. Problem: There are not

Benchmarks • Matrix Multiplication o It is best candidate to implement by SIMD instruction,

Matrix Multiplication Matrix multiplication is one of the most common numerical operations, especially in

Matrix Multiplication (cont. ) With SIMD instructions we could do it with 3 vector

Matrix Multiplication (cont. ) We just Apply 4 Mulps instruction the we have

AES Encryption The Advanced Encryption Standard (AES) is a specification for the encryption of

SIMD instructions Memory and initialization • • • Load: __m 128 _mm_loadu_ps(float *p); Set:

$Array addition implementation Original code: for(size_t i = 0; i < N; i++) {$

Implementation challenges Compiler automatic vectorization Microsoft visual studio does not support Intel compilers support

The results Matrix multiplication: 2 times faster AES encryption: 23% faster

Conclusion • In this project we studied SIMD architecture. • SIMD is very useful

References • • David A. Patterson and John L. Hennessey, "Computer Organization and Design:

Question & answer Q: Why SIMD is faster than the naïve implementations in vector

Slides: 27

Download presentation

A study on SIMD architecture CDA 5106 - Group Project Presentation - Spring 2013 Mohammad Ahmadian Gurkan Solmaz Rouhollah Rahmatizadeh

Outline • • • Introduction to SIMD & brief history SIMD architecture Matrix multiplication AES Implementation & results Conclusion

Introduction to SIMD • Single instruction, multiple data

Introduction to SIMD • SIMD processors: o has a single Control Unit o reading instructions, decoding and sending ctrl signals to the PEs o data are supplied to PEs by a memory • • o # of data paths = # of PEs Interconnecting Network provides flexibility for data to and from the PEs IO system converts the data format

Introduction to SIMD

Brief history • • • First use in vector supercomputers, early 1970 s (e. g. CDC Star 100, TI ASC) Vector proc. became popular by Cray (1970 s, 1980 s) The first modern SIMD machines: massively parallel processing-style supercomputers Thinking Machines CM-1 and CM-2 The current era in the desktop-computers rather than the supercomputers Desktop processors are powerful enough to support real-time gaming, video processing

Brief history • • • Sun Microsystems introduced SIMD integer instructions in VIS (visual instruction set) extensions in Ultra. SPARC I microprocessor (1995) MIPS introduced MDMX (MIPS Digital Media e. Xtension) Intel's MMX extensions to the x 86 architecture (1996) Alti. Vec system in the Motorola Power. PC's, IBM's POWER systems Intel`s SSE system(streaming SIMD extensions) Sony's Playstation 2 and Motorola's MPC 7400

SIMD architecture • Exploits a property of data stream called "data paralelism" • • • SIMD computing is also known as vector processing Programs are written for SISD machines, and include SIMD instructions • Length of vectors the # of elements of a given data type (128 -bit vector to do four-way single-precision floating-point)

SIMD architecture • • One obvious op type, intra-element arithmetic (e. g. addition) and non-arithmetic (e. g. AND, XOR) The other type, interelement arithmetic between the elements of a single vector (e. g. vector permutes, logical shifts)

SIMD architecture - Alti. Vec • • Alti. Vec has 4 distinct registers: 2 source to hold operands, 1 filter/modifier and 1 destination to hold the result source: VA , VB, filter/modifier: VC dest: VT

SIMD architecture - Intel MMX/SSE • • Intel added an extra 8, 128 -bit registers for SSE PIII can dispatch a 64 -bit add and a 64 -bit multiply at the same time

Main idea Using SIMD instruction set for improving programs performance. Problem: There are not any SIMD implemented benchmark application for assessing performance. Solution: Create your own tools first.

Benchmarks • Matrix Multiplication o It is best candidate to implement by SIMD instruction, because it dealt with array of 2 data • AES Encryption Algorithm o It is other candidate to implement by SIMD instruction, and it has vast application ranging from mobile device to distributed data centers. Improvement of AES has effect also on several fields

SIMD operations

Matrix Multiplication Matrix multiplication is one of the most common numerical operations, especially in the area of dense linear algebra, where it forms the core of many important algorithms, including solvers of linear systems of equations, least square problems, and singular and eigenvalue computations.

Matrix Multiplication (cont. )

Matrix Multiplication (cont. ) With SIMD instructions we could do it with 3 vector instructions

Matrix Multiplication (cont. ) We just Apply 4 Mulps instruction the we have

AES Encryption The Advanced Encryption Standard (AES) is a specification for the encryption of electronic data established by the U. S. National Institute of Standards and Technology (NIST) in 2001. Fast Encryption algorithm is desired in several fields as well as mobile device, ecommerce, cloud networks, ….

AES (Advance Encryption Standard)

SIMD instructions Memory and initialization • • • Load: __m 128 _mm_loadu_ps(float *p); Set: __m 128 _mm_set_ps(float z, float y, float x, Store: void _mm_store_ps(float *p, __m 128 a ); float w); Integer/Floating point intrinsics • • • Arithmetic: __m 128 _mm_add_ps(__m 128 a , __m 128 b ); Logical: __m 128 _mm_and_ps(__m 128 a , __m 128 b ); Shift: __m 128 i _mm_slli_si 128 (__m 128 i a, int imm); Conversion: int _mm_cvtsi 128_si 32 (__m 128 i a); Comparison: __m 128 i _mm_cmpeq_epi 8 (__m 128 i a, __m 128 i b); Miscellaneous: int _mm_extract_epi 16 (__m 128 i a, int imm);

$Array addition implementation Original code: for(size_t i = 0; i < N; i++) {$

Array addition implementation Original code: for(size_t i = 0; i < N; i++) { C[i] = A[i] + B[i]; } SIMD optimized: for(int i = 0; i < N; i+=4) { __m 128 a = _mm_loadu_ps(A + i); __m 128 b = _mm_loadu_ps(B + i); __m 128 c = _mm_add_ps(a, b); _mm_storeu_ps(C + i, c); }

Implementation challenges Compiler automatic vectorization Microsoft visual studio does not support Intel compilers support Manually optimization gives the best result. • •

The results Matrix multiplication: 2 times faster AES encryption: 23% faster

Conclusion • In this project we studied SIMD architecture. • SIMD is very useful in some applications. • Not all algorithms can be vectorized. • It needs human labor. • The modern processors will reshape the way of thinking about programming.

References • • David A. Patterson and John L. Hennessey, "Computer Organization and Design: the Hardware/Software Interface", 1998, p. 751 Bertil Svensson, "SIMD Processor Array Architectures". Jon Stokes. "SIMD Architectures. " Ars Technica. N. p. , 2000. Web. 15 Apr. 2013. "SIMD. " Wikipedia. Wikimedia Foundation, 15 Apr. 2013. Web. Intel® Advanced Vector Extensions Programming Reference, 2009. AP-930 Streaming SIMD Extensions - Matrix Multiplication, Intel Co. 1999. The Animation of AES is borrowed from Enrique Zabala

Question & answer Q: Why SIMD is faster than the naïve implementations in vector operations? A: Because in vector processingle instructions need to be executed on multiple data.