7 th Workshop on Scalable Shared Memory Multiprocessor

7 th Workshop on Scalable Shared Memory Multiprocessor 25 th Annual International Symposium on Computer Architecture Memory System Performance of High End SMPs, PCs and Clusters of PCs Ch. Kurmann, T. Stricker Laboratory for Computer Systems ETHZ - Swiss Institute of Technology CH-8092 Zurich Color Slides: http: //www. cs. inf. ethz. ch/Co. Ps/isca 98 ws/ Eidgenössische Technische Hochschule Zürich Ecole polytechnique fédérale de Zurich Politecnico federale di Zurigo Swiss Federal Institute of Technology Zurich

Memory Systems n Low End designs in PCs: u extremely low cost u standard I/O interface n High End designs in “Killer” Workstations: u well engineered memory systems u support for additional datastreams u better I/O busses n Are Low End SMPs the universal compute nodes for parallel and distributed systems? 2

Contribution n The answer is probably the memory system performance. How significant are the differences in memory system performance? Limitations of Low End memory systems u for local computation (e. g. in scientific applications) u for inter-node communication (e. g. in databases) 3

Extended Copy Transfer Characterization ECT is a method to characterize the performance of memory systems (ISCA 95 and HPCA 97): u Categories F Access pattern, stride (spatial locality) F Working set (temporal locality) u Value F Transfer u Same bandwidth (large amount of data) chart resulting from one microbenchmark F Local and Remote transfers F compute and communicate accesses 4

Measurement Problems Some parameter combinations are hard to measure, even with carefully tuned C code: u Reduced performance for large strides and small working-sets in L 1 caches is a measurement artifact and not architecture related. u Compilers occasionally generate suboptimal instruction schedules for loads / stores. 5

Local Load Access: Pentium Pro PC Pentium Pro FX one processor 200 MHz 500 400 300 200 100 L 1 0 it w ord s) 6 0 0. 5 K 1 K 2 K 4 K 8 K 16 K 32 K 64 K 128 K 256 K 512 K 1 M 2 M 4 M 8 M 16 M Acc (str ess pa ide t bet tern wee n 6 4 b L 2 100 DRAM Wo g rkin set Load bandwidth (MBytes/sec) 600 500 1 2 3 4 5 6 7 8 12 15 16 24 31 32 48 63 64 96 127 128 Load bandwidth (MByte/s) 600

Local Load Access: SGI Origin 10000 one processor 195 MHz 1400 1200 1000 800 600 400 L 1 200 400 200 0 it w ord s) 7 0 0. 5 K 1 K 2 K 4 K 8 K 16 K 32 K 64 K 128 K 256 K 512 K 1 M 2 M 4 M 8 M 16 M 32 M 64 M Acc (str ess pa ide t bet tern wee n 6 4 b L 2 Wo g rkin set Load bandwidth (MBytes/sec) 1600 1400 1 2 3 4 5 6 7 8 12 15 16 24 31 32 48 63 64 96 127 128 Load bandwidth (MByte/s) 1600

Local Load Access: DEC 8400 DEC Alpha 8400 one processor 300 MHz 1000 800 600 400 L 1 200 L 2 0 it w ord s) 8 0 L 3 . 5 k 1 k 2 k 4 k 8 k 16 k 32 k 64 k 128 k 256 k 512 k 1 M 2 M 4 M 8 M 16 M 32 M 64 M Acc (str ess pa ide t bet tern wee n 6 4 b Wo g rkin set Load bandwidth (MBytes/sec) 1200 1 2 3 4 5 6 7 8 12 15 16 24 31 32 48 63 64 96 127 128 Load bandwidth (MByte/s) 1200

Local Load Access: Sun Enterprise Sun Ultra Enterprise one Ultra SPARC II 248 MHz 700 500 600 400 500 300 400 200 300 100 200 0 L 1 it w ord s) 9 L 2 0 0. 5 K 1 K 2 K 4 K 8 K 16 K 32 K 64 K 128 K 256 K 512 K 1 M 2 M 4 M 8 M 16 M Acc (str ess pa ide t bet tern wee n 6 4 b 100 DRAM Wo g rkin set Load bandwidth (MBytes/sec) 600 1 2 3 4 5 6 7 8 12 15 16 24 31 32 48 63 64 96 127 128 Load bandwidth (MByte/s) 700

Local Load Access: SGI Cray T 3 E one processor 300 MHz 1000 800 600 400 200 L 1 0 it w ord s) 10 L 2 0 0. 5 K 1 K 2 K 4 K 8 K 16 K 32 K 64 K 128 K 256 K 512 K 1 M 2 M 4 M 8 M 16 M Acc (str ess pa ide t bet tern wee n 6 4 b 200 DRAM Wo g rkin set Load bandwidth (MBytes/sec) 1200 1000 1 2 3 4 5 6 7 8 12 15 16 24 31 32 48 63 64 96 127 128 Load bandwidth (MByte/s) 1200

Comparison - Local Access 11

Performance in an SMP setting n n Copy bandwidth decreases for simultaneous access with 1, 2, 4 and 8 processors Topics of interest: u small working sets in caches: performance remains same u large working sets in memory: interesting differences u behavior for even/uneven strides n “Gather copy stream” (strided load / contiguous store) 12

Local Copy: Pentium Pro SMP 13

Local Copy: SGI Origin CC-NUMA 14

Local Copy: DEC 8400 SMP 15

Local Copy: Sun Enterprise SMP 16

Remote in Parallel Computers Parallel & Network Computers Symmetric Multiprocessors P P P C C C M M M Bus/Network M Network SGI Cray T 3 E, SGI Origin Clusters of PCs (Co. Ps) P Processor C M DEC 8400, Sun Enterprise, Pentium Pro SMPs Caches 17 M Memory

Remote Copy bandwidth (Mbyte/s) Remote Transfers: Co. Ps Pentium Pro with SCI / Myrinet t 80 70 128 s 60 50 40 l l 30 l 20 l l l 10 t 0 1 s s 2 3 4 5 s 6 s 7 s s s t s 8 12 16 24 32 48 64 Access pattern (stride between 64 bit words) l local copy t remote copy by Myrinet s remote copy by SCI 18

Remote Transfers: SGI Origin 19

Remote Transfers: DEC 8400 20

Remote Transfers: SGI Cray T 3 E 21

Comparison - Remote Transfers 22

Improvement of PC Chipsets n n n Intel 440 BX AGP Chip Set 400 MHz / 100 MHz Intel 440 LX AGP Chip Set 233 MHz / 66 MHz Intel 440 FX Natoma Chip Set 200 MHz / 66 MHz 23

Conclusion n ECT-Characterizations for different memory systems: u T 3 E (MMP-Node), Origin (NUMA), DEC 8400 (SMP) u Co. Ps Intel P 6 SMPs and Clusters n High End SMP vs. Low End SMP: u Less n than half performance on two processor PCs. Fast communication puts high demands on the memory system: u Unlike in traditional SMPs and CC-NUMAs fine grained remote access do not perform at all in PC-SMPs and Co. Ps n Adding more commodity microprocessors without reinforcing the memory system is therefore questionable. 24