SIMD Single Instruction Multiple Data Creative Commons License
















- Slides: 16
SIMD Single Instruction Multiple Data Creative Commons License – Curt Hill
SIMD • Only successful when the data is highly parallel • Where there is a very large amount of time spent on array processing • The array element processing is somewhat independent – Such as adding corresponding array elements of two arrays • There are plenty of applications but they are specialized, usually scientific Creative Commons License – Curt Hill
Data level parallelism • Suppose, we have two arrays of 32 floating point operands and we want to add them • A single processor will go down the line summing one at a time – If it is superscalar and it has two FPUs it can do this slightly more than 16 units of time otherwise 32 • Not bad but generally outperformed by array and vector processors Creative Commons License – Curt Hill
Array Processor • Single control unit that drives multiple ALUs – The ALUs usually have individual memories • In the previous case it will take 16 -32 units • Here if there are 32 floating point units and the vector register contains 32 slots it will take one unit • When adding two scalar variables the two would be the same speed, but when adding two array variables (length<=32) then the vector processor would be 32 times faster Creative Commons License – Curt Hill
Why • In most applications such parallelism would be a waste, but in many scientific applications an array of size 32 is pretty small and substantial use could be made of this parallelism • An array processor is a large number of identical processors that perform the same instruction on different pieces of data – Single control unit for the many processors – Parallel memories for the parallel processors Creative Commons License – Curt Hill
Examples • ILLIAC IV was the first in the late 60 s – Largely used by NASA for fluid dynamics calculations – Very large amount of parallelism in this application • Thinking Machines Connection Machine 1 and 2 • Goodrich Massively Parallel Processor • Mas. Par MP 1 and 2 Creative Commons License – Curt Hill
Disadvantages: • Hardware heavy – expensive – Never mass produced since they fit a niche market – Register/memory configuration is unusual • Difficult to program – Most languages have no support – High Performance FORTRAN is usual choice • Exceptional performance but only on truly parallel computations Creative Commons License – Curt Hill
Vector processor • Essentially a normal processor, usually superscalar, heavily pipelined • What it has different are vector registers – A normal register contains a single value, either integer or floating point of some size – A vector register contains an array of these items that can be added with array arithmetic Creative Commons License – Curt Hill
Crays • Most of the early Cray super computers were vector processors – Cray also makes MIMDs • Programmed more like a regular processor – There was usually a vector load/store instruction • The number of values in a vector register was usually modest: 4 -8 – This made the cost more reasonable – The performance was not so lopsided for vector operations Creative Commons License – Curt Hill
Commercially • The market for these sorts of array and vector processors is very limited • There are few organizations that will always be able to utilize them – Usually national laboratories and sophisticated engineering companies • In general it is a niche market • However there are some common examples as well Creative Commons License – Curt Hill
Intel MMX instructions • The Pentium should not be considered a vector processor • Yet it has vector operations in the MMX subset – The SSE sets extend these • These allow one 32 bit register to be considered four eight-bit registers or two 16 bit registers • This allows array processing of 8 bit pixels or 16 bit sound samples Creative Commons License – Curt Hill
GPU • The graphics processing unit is the most common vector processor • The pixel manipulation present in a GPU is an ideal SIMD environment • Shading, for example, can be easily done in parallel • Lets consider one GPU: ATI Radeon HD 4870 – This is now several years old – They are faster now Creative Commons License – Curt Hill
Radeon HD 4870 • There are 10 cores – Each is a SIMD core • Each core has 256 registers • Each of these registers is actually a vector register of size 64 • The contents of one of these slots is a 4 byte float • Multiply this out and it is 2. 5 MB of register storage Creative Commons License – Curt Hill
Exploiting the GPU • There is substantial power sitting in the GPU • If not using 3 D moving displays (such as games) or video playing most of this power is sitting idle • A number of options are now available to use this for scientific computing • GPGPU – General Purpose computing on Graphics Processing Unit Creative Commons License – Curt Hill
Super Computers • A number of groups have organized clusters of GPUs to achieve super computers • Example: Chinese Mole 8. 5 (2011) – 2200 NVIDIA Tesla GPUs – Used to simulate an H 1 N 1 influenza virus Creative Commons License – Curt Hill
Finally • The scientific big computers are a niche market • Supercomputers have been fabricated using clusters of GPUs – This is likely the future of SIMD Creative Commons License – Curt Hill