A manycore GPU architecture Larrabee GPUs vs CPUs

A many-core GPU architecture. Larrabee

GPUs vs CPUs Price, performance, and evolution.

Definitions �CPU (Central Processing Unit) – general purpose processor able to execute computer programs. �GPU (Graphics Processing Unit) - dedicated graphics rendering device.

Price and Performance �The n. VIDIA Ge. Force 6800 Ultra is able to reach a performance of 40 Gflops whereas an Intel 3 GHz Pentium 4 is able to reach only 6. [1] �What is more impressive, current cards such as ATI HD 5870, AMD Fire. Stream 9250, NVIDIA Ge. Force 9800 run between 1 and 3 TFLOPS. �Reasons for this include highly parallel vector processing, fast onboard memory, and pipeline constraints which stream data without stalls.

Evolution �GPU performance has approximately doubled every 6 months since the mid-1990 s. �CPU performance doubles every 18 months on average (Moore’s law).

Current trends How we use GPUs.

Alternative applications �New trends are showing GPU use in scientific computing using data-parallel algorithms. Examples include:

Clustering GPU clustering to simulate the dispersion of airborne contaminants in New York City.

Image Stitching Fast seamless stitching and tone-mapping of gigapixel images. (~1 hour on a notebook PC)

Molecular Dynamics Molecular dynamics to evaluate forces between atoms that do not share bonds.

Architecture How it is built.

Key differences TYPICAL GPU � Ordered sequence of rendering steps. � Fixed hardware dedicated to each step. LARABEE � Runs most of its pipeline in software running on multiple general purpose x 86 cores. � This allows the rendering pipeline to be reconfigured dynamically. Hence, we are able to skip steps or allocate extra resources when required.

Larrabee CPU Core � The Larrabee core is “derived” from the Pentium processor. � 1 scalar unit for single operations and 1 vector unit for multiple operations. � 32 KB L 1 data and instruction cache. � 256 KB L 2 cache which share a ring network.

Details � 8 KB L 1 cache is 4 times larger than original Pentium. � This is due to the fact that each core is able to perform four-way multithreading to reduce thread switching overhead. (Not to be confused with simultaneous multithreading. ) � The 256 KB L 2 cache share a ring network. If a core is unable to find data in its own L 2 cache, it places a request on a ring bus/network and will eventually find the data in its L 2. � Uses a rendering technique called binning, which divides the screen into regions, and renders polygons accordingly.

Benefits of Larrabee Game physics Real-time ray tracing Image and video processing Physical simulation Extended rendering capabilities

References �[1] Zhe Fan, Feng Qiu, Kaufman A. , Yoakum- Stover S. GPU Cluster for High Performance Computing. 2004. ACM / IEEE Supercomputing Conference 2004, November 06 -12, Pittsburgh, PA. �[2] L. Seiler et al. 2008. Larrabee: A Many. Core x 86 Architecture for Visual Computing. ACM Transactions on Graphics, vl. 27, n. 3, Article 18, August 2008.