INF 5062 Programming asymmetric multicore processors Introduction 298

Disclaimer § Asymmetric and heterogeneous multi-core processors − This is a developing terminology −

Overview § Course topic and scope § Background for the use and parallel processing

People § Håvard Espeland (TA) email: haavares @ ifi § Pål Halvorsen email: paalh

About INF 5062: Topic & Scope § Content: The course gives … − …

About INF 5062: Topic & Scope § Tasks: An important part of the course

About INF 5062: Exam § § Prerequisite – mandatory assignments: − − Lab assignment

Available Resources § Resources will be placed at − http: //www. ifi. uio. no/~griff/INF

Motivation: Intel View § Soon >billion transistors integrated University of Oslo INF 5062, Pål

Motivation: Intel View § Soon >billion transistors integrated § Clock frequency can still increase

Motivation “Future applications will demand TIPS” “Think platform beyond a single processor” “Exploit concurrency

Background and Motivation: Symmetric multi-processing

Symmetric Multi-Core Processors Intel Dual-Core Xeon University of Oslo INF 5062, Pål Halvorsen and

Symmetric Multi-Core Processors Phenom X 4 University of Oslo INF 5062, Pål Halvorsen and

Symmetric Multi-Core Processors Ultra. Sparc University of Oslo INF 5062, Pål Halvorsen and Carsten

Intel Multi-Core Processors § Symmetric multi-processors allow multi-threaded applications to achieve higher performance at

Symmetric Multi-Core Processors § Good − Growing computational power § Problematic − − Growing

Asymmetric Multi-Core Processors § Asymmetric multi-processors consume power and provide increased computational power only

Honogeneous Multi-Core Processors § Operating systems scale only to a limited number of §

Background and Motivation: History of heterogeneous multi-processing

Co-Processors § The original IBM PC included a socket for an Intel 8087 floating

Graphics Processing Units (GPUs) GPU: buss connector a dedicated graphics rendering device & memory

Graphics Processing Units (GPUs) buss connector & memory hub New powerful GPUs, e. g.

General Purpose Computing on GPU § The − − − high arithmetic precision extreme

Background and Motivation: Heterogeneous multi-processing n. VIDIA GPUs

n. VIDIA G 92 § n. VIDIA GT 280 (latest and greatest) − 1.

n. VIDIA G 92 § Stream Multiprocessor − Per TPC (3 clusters) Streaming Multiprocessor

Memory Bandwidth for CPU and GPU Marketed as GPGPUs University of Oslo INF 5062,

Background and Motivation: Heterogeneous multi-processing The Cell Broadband Engine

STI (Sony, Toshiba, IBM) Cell § Motivation for the Cell − − Cheap processor

STI (Sony, Toshiba, IBM) Cell § Cell is a 9 -core processor − combining

STI (Sony, Toshiba, IBM) Cell − Synergistic Processing Elements (SPE) • specialized co-processors for

STI (Sony, Toshiba, IBM) Cell − memory controller • Rambus XDRAM interface to Rambus

STI (Sony, Toshiba, IBM) Cell − Cell has in essence traded running everything at

Background and Motivation: Heterogeneous multi-processing IXP Network Processor

Co py ing Ch ec ks u mm Fr ag i ng me nt

IXA: Internet Exchange Architecture § IXA − a broad term to describe the Intel

IXA: Internet Exchange Architecture § IXP 1200 basic features − − − − 1

IXA: Internet Exchange Architecture § IXP 2400 basic features − − − − 1

IXP 1200 Architecture RISC processor: - Strong. ARM running Linux - control, higher layer

IXP 1200 IXP 2400 IXP 1200 PCI bus SRAM access SRAM FLASH SCRATCH memory

IXP 2400 Architecture Coprocessors - hash unit - 4 timers SRAM - general purpose

Background and Motivation: Heterogeneous multi-processing Summary

The End: Summary § Heterogeneous multi-core processors are already everywhere ðChallenge: programming − Need

Slides: 50

Download presentation

INF 5062: Programming asymmetric multi-core processors Introduction 29/8 - 2008

Disclaimer § Asymmetric and heterogeneous multi-core processors − This is a developing terminology − Asymmetric • Multi-core chips • Entirely identical instruction set • Asymmetry in speeds, frequencies, power consumption, etc. − Heterogenous • Multi-core chips • Heterogeneous instruction sets “Heterogeneous” is nowadays better; the course name will change next year University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz

Overview § Course topic and scope § Background for the use and parallel processing using heterogeneous multi-core processors § Examples of heterogeneous architectures University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz

INF 5062: The Course

People § Håvard Espeland (TA) email: haavares @ ifi § Pål Halvorsen email: paalh @ ifi § Carsten Griwodz email: griff @ ifi University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz

About INF 5062: Topic & Scope § Content: The course gives … − … an overview of heterogeneous multi-core processors in general and three variants in particular (architectures and use) − … an introduction to working with heterogenous multi-core processors − • Intel IXP 2400 network processor card • n. VIDIA’s G 80 family of GPUs and the CUDA programming framework • The Cell Broadband Engine Architecture … some ideas of how to use/program heterogeneous multi-core processors (regular and guest lectures) University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz

About INF 5062: Topic & Scope § Tasks: An important part of the course are lab-assignments where the students program each of the three examples of heterogeneous multi-core processors 1. On the Intel IXP • Protocol statistics – download and run wwpingbump and then extend it to give processor, interface and protocol statistics 2. On the Cell processor • Video encoding – download Motion JPEG compression software and improve its performance by using the Cell processor’s SBEs 3. On the n. VIDIA graphics cards • Video encoding – the same goal as above, but exploit the parallelity of the G 80 architecture University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz

About INF 5062: Exam § § Prerequisite – mandatory assignments: − − Lab assignment 2: solve task 2 or 3 Present it to the class 2 graded assignments (counting 33% each): − Lab assignment 1: solve task 1 − Lab assignments 3: solve task 2 and 3 • • • Deliver code Make a demonstration to the class Explain your design and code • • Deliver code Demonstrate that lab assignment to the class that was not shown in the mandatory assignment Explain your design and code • § Final oral exam (counting 33%): early December 2008 − − Content of the lectures Content of lab assignments The experimental platforms used The own code University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz

Available Resources § Resources will be placed at − http: //www. ifi. uio. no/~griff/INF 5062 − Login: inf 5062 − Password: ixp − Manuals, papers, code example, … University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz

Background and Motivation: Moore’s Law

Motivation: Intel View § Soon >billion transistors integrated University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz

Motivation: Intel View § Soon >billion transistors integrated § Clock frequency can still increase University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz

Motivation: Intel View § Soon >billion transistors integrated § Clock frequency can still increase § Future applications will demand TIPS University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz

Motivation: Intel View § § Soon >billion transistors integrated Clock frequency can still increase Future applications will demand TIPS Power? Heat? University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz

Motivation “Future applications will demand TIPS” “Think platform beyond a single processor” “Exploit concurrency at multiple levels” “Power will be the limiter due to complexity and leakage” Distribute workload on multiple cores University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz

Background and Motivation: Symmetric multi-processing

Symmetric Multi-Core Processors Intel Dual-Core Xeon University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz

Symmetric Multi-Core Processors Phenom X 4 University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz

Symmetric Multi-Core Processors Ultra. Sparc University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz

Intel Multi-Core Processors § Symmetric multi-processors allow multi-threaded applications to achieve higher performance at less die area and power consumption than single-core processors University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz

Symmetric Multi-Core Processors § Good − Growing computational power § Problematic − − Growing die sizes Some cores used much more than others Individual cores frequently unused Many core parts frequently unused § Why not spread the load better? − Functions exist only once per core − Parallel programming is hard Asymmetric multi-core processors University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz

Asymmetric Multi-Core Processors § Asymmetric multi-processors consume power and provide increased computational power only on demand Highly parallel University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz Moderately parallel Sequential

Honogeneous Multi-Core Processors § Operating systems scale only to a limited number of § threads Where does the increase in core numbers stop? How many applications need more than 64 threads? Performance? software limits are the issue § Application-specific engines? Intel Academic Forum 2006: YES But: Programming model? ? ? University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz

Motivation “Future applications will demand TIPS” “Think platform beyond a single processor” “Exploit concurrency at multiple levels” “Power will be the limiter due to complexity and leakage” Distributed workload on multiple cores + simple processors are easier to program +consume less energy heterogeneous multi-core processors University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz

Background and Motivation: History of heterogeneous multi-processing

Co-Processors § The original IBM PC included a socket for an Intel 8087 floating point co-processor (FPU) − 50 -fold speed up of floating point operations § Intel kept the co-processor up to i 486 − 486 DX contained an optimized i 487 block − Still separate pipeline (pipeline flush when starting and ending use) − Communication over an internal bus § Commodore Amiga was one of the earlier machines that used multiple processors − Motorola 680 x 0 main processor − Blitter (block image transferrer - moving data, fill operations, line drawing, performing boolean operations) − Copper (Co-Processor - change address for video RAM on the fly) − And finally: IBM Power. PC - relegate the 680 x 0 to a co-processor job University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz

Graphics Processing Units (GPUs) GPU: buss connector a dedicated graphics rendering device & memory hub First GPUs, 3 D ü 80 s: for early 2 D operations Amiga and Atari used a blitter, Amiga had also the copper 2 D ü 90 s: 3 D hardware for game consoles like PS and N 64 3 dfx Voodoos 3 D add-on card for PCs University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz

Graphics Processing Units (GPUs) buss connector & memory hub New powerful GPUs, e. g. , : ü Nvidia Ge. Force GX 280 ü 30 400 MHz core ü 1 GB memory ü memory BW: 141+ GBps ü 1. 296 GHz ü similar to other manufacturers … University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz

General Purpose Computing on GPU § The − − − high arithmetic precision extreme parallel nature optimized, special-purpose instructions available resources … … of the GPU allows for general, non-graphics related operations to be performed on the GPU § Generic computing workload is off-loaded from CPU and to GPU More generically: Heterogeneous multi-core processing University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz

Background and Motivation: Heterogeneous multi-processing n. VIDIA GPUs

n. VIDIA G 92 § n. VIDIA GT 280 (latest and greatest) − 1. 4 billion transistors − 240 shaders − 512 bit memory bus (GDDR 3) − 141, 7 GB/s memory bandwidth − 933 Gflops − PCI Express 2. 0 University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz

n. VIDIA G 92 § Stream Multiprocessor − Per TPC (3 clusters) Streaming Multiprocessor (SM) • 16 b. K cache (in TEX) Instruction Fetch − Per SM Instruction L 1 Cache • 16 k. B level 1 cache • 64 k. B shared memory Thread / Instruction Dispatch Work − Global Shared Memory • 256 k. B level 2 cache § Number of stream multiprocessors − − 1 - Quadro NVS 130 M 16 - Ge. Force 8800 Ultra / GTX 30 - Ge. Force GTX 280 4 x 30 - Tesla S 1070 University of Oslo L 1 Fill S F U Control SP 0 RF 4 SP 1 RF 5 SP 2 RF 6 SP 3 RF 7 SP 7 Results S F U Load Texture Constant L 1 Cache Load from Memory INF 5062, Pål Halvorsen and Carsten Griwodz L 1 Fill Store to Memory

Memory Bandwidth for CPU and GPU Marketed as GPGPUs University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz

Background and Motivation: Heterogeneous multi-processing The Cell Broadband Engine

STI (Sony, Toshiba, IBM) Cell § Motivation for the Cell − − Cheap processor Energy efficient For games and media processing Short time-to-market § Conclusion − Use a multi-core chip − Design around an existing, powerefficient design − Add simple cores specific for game and media processing requirements University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz

STI (Sony, Toshiba, IBM) Cell § Cell is a 9 -core processor − combining a light-weight generalpurpose processor with multiple co-processors into a coordinated whole − Power Processing Element (PPE) • conventional Power processor • not supposed to perform all operations itself, acting like a controller • running conventional OSes • 16 KB instruction/data level 1 cache • 512 KB level 2 cache University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz

STI (Sony, Toshiba, IBM) Cell − Synergistic Processing Elements (SPE) • specialized co-processors for specific types of code, i. e. , very high performance vector processors • local stores • can do general purpose operations • the PPE can start, stop, interrupt and schedule processes running on an SPE − Element Interconnect Bus (EIB) • internal communication bus • connects on-chip system elements: § PPE & SPEs § the memory controller (MIC) § two off-chip I/O interfaces • 25. 6 GBps each way University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz

STI (Sony, Toshiba, IBM) Cell − memory controller • Rambus XDRAM interface to Rambus XDR memory • dual channels at 12. 8 GBps 25. 6 GBps − I/O controller • Rambus Flex. IO interface which can be clocked independently • dual configurable channels • maximum ~ 76. 8 GBps University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz

STI (Sony, Toshiba, IBM) Cell − Cell has in essence traded running everything at moderate speed for the ability to run certain types of code at high speed − used for example in • Sony Play. Station 3: § 3. 2 GHz clock § 7 SPEs for general operations § 1 SPE for security for the OS • Toshiba home cinema: § decoding of 48 HDTV MPEG streams dozens of thumbnail videos simultaneously on screen • IBM blade centers: § 3. 2 GHz clock § Linux ≥ 2. 6. 11 University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz

Background and Motivation: Heterogeneous multi-processing IXP Network Processor

Co py ing Ch ec ks u mm Fr ag i ng me nt a In t i on ter ru pt s Review of General Data Path on Conventional Computer Hardware Architectures sending: receiving: application forwarding: application communication system user space kernel space transport (TCP/UDP) communication network (IP) system link University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz

IXA: Internet Exchange Architecture § IXA − a broad term to describe the Intel network architecture − HW & SW, control- & data plane § IXP: Internet Exchange Processor − processor that implements IXA − IXP 1200 is the first IXP chip (4 versions) − IXP 2 xxx has now replaced the first version University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz

IXA: Internet Exchange Architecture § IXP 1200 basic features − − − − 1 embedded 232 MHz Strong. ARM 6 packet 232 MHz µengines onboard memory 4 x 100 Mbps Ethernet ports multiple, independent busses low-speed serial interfaces for external memory and I/O busses − … University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz

IXA: Internet Exchange Architecture § IXP 2400 basic features − − − − 1 embedded 600 MHz XScale 8 packet 600 MHz µengines onboard memory 3 x 1 Gbps Ethernet ports multiple, independent busses low-speed serial interfaces for external memory and I/O busses − … University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz

IXP 1200 Architecture RISC processor: - Strong. ARM running Linux - control, higher layer protocols and exceptions - 232 MHz Access units: - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and synchronization Microengines: - low-level devices with limited set of instructions - transfers between memory devices - packet processing - 232 MHz University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz

IXP 1200 IXP 2400 IXP 1200 PCI bus SRAM access SRAM FLASH SCRATCH memory MEMORY MAPPED I/O PCI access multiple independent internal buses Embedded RISK CPU (Strong. ARM) microengine 1 microengine 2 microengine 3 microengine 4 microengine 5 SDRAM access DRAM IX access DRAM bus IX bus University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz microengine 6

IXP 2400 Architecture Coprocessors - hash unit - 4 timers SRAM - general purpose I/O pins bus - external JTAG connections (in-circuit tests) - several bulk cyphers (IXP 2850 only) SRAM - checksum (IXP 2850 only) -… PCI bus IXP 2400 RISC processor: - Strong. Arm XScale - 233 MHz 600 MHz SRAM access coprocessor SCRATCH memory Slowport. FLASH - shared inteface to external units - used for Flash. Rom during bootstrap slowport access SDRAM access DRAM PCI access Embedded RISK CPU (XScale) multiple independent internal Mediabuses Switch Fabric microengine 1 microengine 2 microengine 3 microengine 4 - forms fast path for transfers Microengines - interconnect for severalmicroengine IXP 2 xxx -5 6 8 MSF access … microengine 8 DRAM bus Receive/transmit buses - shared bus separate busses University of Oslo receive bus INF 5062, Pål Halvorsen and Carsten Griwodz - 233 MHz 600 MHz transmit bus

Background and Motivation: Heterogeneous multi-processing Summary

The End: Summary § Heterogeneous multi-core processors are already everywhere ðChallenge: programming − Need to know the capabilities of the system − Different abilities in different cores − Memory bandwidth − Memory sharing efficiency − Need new methods to program the different components § Next time: how to start programming the Intel IXP University of Oslo INF 5062, Pål Halvorsen and Carsten Griwodz