Piranha A Scalable Architecture Based on SingleChip Multiprocessing

Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Luiz A. Barroso et al. (Compaq Computer Corporation) Presented by: Nick Kirchem Feb 13, 2004

Target and Motivation l Commercial applications (databases, OLTP) – – – l Most important market for high performance servers Data dependent computation (low ILP) Little gained by complex multiple issue out-of-order processors Complexity of current processors – – – Long design times High development costs Better use of transistors?

Project Goals l Design a Chip Multiprocessing (CMP) System – – l Integrate 8 simple processor cores on a single chip Exploit thread-level parallelism instead of ILP High performance, Low Cost – – Achieve superior performance on commercial workloads Small team, modest investment, short design time

Architecture Overview

Architecture Elements l l l Simple Processors (500 MHz, In-Order) No I/O capability on chip (separate I/O nodes) Up to 1024 nodes in a system Individual L 1 Caches (64 KB, 2 -way set-assoc) One Logical L 2 Cache, interleaved, 1 MB Intra-Chip Switch – – – Unidirectional crossbar Transaction based, atomic transfers Bandwidth ~3 x memory bandwidth

Intra-Chip Cache Coherence l l MESI protocol No Inclusion (1 MB aggregate L 1, 1 MB L 2) – – l But, L 2 holds copy of L 1 tags and state (no snooping required at L 1) L 1 filled directly from memory (L 2 = victim cache) Coherence handled by L 2 controllers – Can service request directly, forward to owner L 1, forward to protocol engine, obtain from Memory

Inter-Node Coherence l Protocol Engines (microprogrammable controllers) – – l Directory Storage – – l l Home: exports local memory Remote: imports remote memory Compute ECC at coarse granularity, use extra bits for directory info no memory space overhead Directory granularity = 1 node (not individual processor) Interconnect: I/O queues, router (point-to-point, 4 links) No NAKs – avoid deadlock by sufficient buffering, and guarantee forwarded requests can be serviced

Performance Evaluation l l l OLTP and DSS workloads: TPC-B/D, Oracle database Sim. OS-Alpha environment Compared: – – l Single Chip Evaluation – – – l Piranha (P 8) @ 500 MHz and Full-Custom (P 8 F) @ 1. 25 GHz Next-generation Microprocessor (OOO) 1 GHz OOO outperforms P 1 (individual proc) by 2. 3 x P 8 outperforms OOO by 3 x Speedup of P 8 over P 1 = 7 x Multi-chip Configurations – – Four chips (only 4 CPUs per chip ? !) Results show that Piranha scales better than OOO

Questions/Concerns l l l Would the Piranha design be worthwhile if there were a well-designed SMT processor (with 4 or 8 threads)? Reliability better or worse with multiple chips per processor? Power consumption?