Cell Processor Cell Broadband Engine Architecture Mark Budensiek

Cell Processor (Cell Broadband Engine Architecture) Mark Budensiek 1

Background • Joint collaboration of IBM/Sony/Toshiba (STI) Ø First implementation of the architecture in 2005 • Develop a new/next-gen processor Ø Initially for Play Station 3 Ø Others, multimedia application (Blu-ray, HDTV) Ø Server systems Ø Supercomputers 2

Synergistic Processing Element 3

Power Processor Element (PPE) • The PPE is a 64 bit, "Power Architecture“ Ø capable of running POWER or Power. PC binaries Ø Acts as the controller for the 8 SBEs 4

Element Interconnect Bus • Connects various on chip elements Ø PPE , 8 SPEs, memory controller (MIC) & off-chip I/O interfaces • Data-ring structure with control of a bus Ø 4 unidirectional rings but 2 rings run counter direction to other 2 Ø Worst-case maximum latency is only half distance of the ring • Each ring is 16 bytes wide and runs at half the core clock frequency (core clock freq ~3. 2 GHz) 5

Synergistic Processing Elements • An SPE is composed of a Synergistic Processing Unit and a Memory Flow controller. Ø SPU is a SIMD, RISC-based processor (3. 2 GHZ) Ø SPU’s ISA a cross between VMX and the PS 2’s Emotion Engine. • Single Instruction Multiple Data (SIMD) organization Ø Multiple processing elements that perform the same operation on multiple data simultaneously. • Statically scheduled (compiler plays big role) Ø Also no dynamic (branch) prediction hardware (relies on compiler generated hints) • Each SPE consists of: Ø Ø 128 x 128 register Local Store (SRAM) DMA unit FP, LD/ST, Permute, Branch Unit (each pipelined) 6

SPE Architecture 7 Copyright: IBM

SPU Architecture Overview • 128 General Purpose Registers (each 128 bits) • Support for 16 -bit (half-word) and 32 -bit (word) signed Integers and 8 -bit unsigned Integers. • Support for single-precision (32 -bit) and double-precision (64 -bit) floating-point data. • No condition register. • Local storage. SPU load/store transfers quad-words between GPRs and storage. Storage size can vary but address space limited to 4 GB. • Channel interface to external devices. GPRs channel interface Ø Up to 128 channels • Supports up to 128 special-purpose registers 8

Data Layout in Registers The leftmost word (bytes 0, 1, 2, and 3) of a register is called the preferred slot. When instructions use or produce scalar operands or addresses, the values are in the preferred slot. A set of store assist instructions is available to help store bytes, halfwords, and doublewords. 9

SPE Local Store • Each SPE has local on-chip memory a. k. a Local Store(LS) Ø Instruction and Data store Ø Visible to PPE and can be addressed directly Ø Does not operate like cache • Data/instructions are transferred between LS and system memory/other SPE’s LS using DMA unit Ø 128 bytes at a time(transfer rate of 0. 5 terabytes/sec) Ø DMA transactions are coherent 10

SPU ISA Instructions • 32 Bits in length • 6 basic instruction formats RR Instruction Format: RI 7 Instruction Format: 11

SPU ISA Instructions (cont) RI 10 Instruction Format: RI 16 Instruction Format: RI 18 Instruction Format: 12

Types of Instructions • • • Memory – Load/Store Constant-Formation Integer and Logical Shift and Rotate Compare, Branch, Halt Hint for Branch Floating Point Control Channel 13

Memory – Load/Store Instructions • Size of local storage address space is (up to) 2^32 bytes = 4 GB • Local storage is byte-addressed • Load/Store inst combine operands from one or two regs and/or an immediate value to form the effective addr of the memory operand. • Only aligned 16 -byte-long quadwords can be loaded and stored. Therefore, the right-most 4 bits of an effective address are always ignored and are assumed to be zero. 14

Memory – Load/Store Instructions Example: Load Quad-word (RR format) 15

Constant-Formation Instruction • Loads immediate values to target register Example: Immediate Load Word 16

Integer and Logical Instructions • Full compliment of arithmetic functions ex. Add, Subtract, Multiply, Generate carry, Generate borrow, Average, Sum, … • Logical functions: And, Or, XOR, Nand, Nor, Equivelent, … • Both Reg and Immediate instruction formats Examples: Add Word 17

Integer and Logical Instructions (cont) And 18

Shift and Rotate Instructions Shift Left halfword 19

Shift and Rotate Instructions Rotate halfword 20

Compare, Branch, and Halt Instructions • Conditional Branch -No condition code register -Utilize GPR value usually set by a compare instruction Register value set to all 1’s for all 0’s based on compare result -Logical compare instructions treat the operands as unsigned integers • Halt instructions -Stops execution when tested condition is met -The stop is not precise. As a result, execution cannot generally be restarted. 21

Compare, Branch, and Halt Instructions (cont) Compare Equal Word Branch if not Zero Word 22

Compare, Branch, and Halt Instructions (cont) Halt If Greater Than 23

SPU ISA Purpose is to achieve high performance on critical workloads for game, media, and broadband systems. Key SPU Workloads: • Graphics pipeline which includes subdivision and rendering. • Stream processing, which includes encoding, decoding, encryption, and decryption • Modeling, witch includes game physics Implementations of the SPU ISA achieve better performance to cost ratios than general-purpose processors because the SPU ISA implementations require half the power and half the chip area for equivalent performance. 24

SPU ISA and the 4 Principles 1. Simplicity favors regularity - - All instructions are the same length. All Immediate instructions follow a similar format (fields in a common location). Register-type instructions can vary in format depending on number of registers used. Register block is 128 x 128(bit) 2. Smaller is faster - Large number of GPRs and SPRs 32 -bit instructions 3. Make the common case fast - Single precision floating point calculations 4. Good design demands good compromises - Large register size facilitates SIMD computations 25

Summary (of Cell) • Cell processor architecture is optimized for digital media and entertainment • Facilitating convergence between supercomputing and entertainment – desire for realism. • Enables new classes of applications. 26

Programming the cell is challenging Issues • Dividing program among different cores • Creating instructions in a different language for the 8 SPEs than for the Power. PC core. • Need to think in terms of SIMD nature of dataflow to get maximum performance from SPUs • SPU local store needs to perform coherent DMA access for accessing system memory 27

Compiling and Binding of a program on CELL 28 Copyright: IBM

Questions? 29