SR 8000 Concept Tim Lanfear Hitachi Europe Gmb

SR 8000 Model Range 2 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

SR 8000 Appearance 3 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Compact Model 4 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Vector vs SMP vs MPP Feature Vector SMP MPP Single Node Performance Scalability Programming

System Architecture Cross-bar Inter-node Network Node (ION) PCI CPU Node (PRN) System Control Network Control Main Memory Ether, ATM, HIPPI Node (PRN) RAID Disk Service Processor 6 Console All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Programming Models 7 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

CPU Architecture Main Memory • 16 bytes/cycle memory BW • 128 Kbyte L 1 cache • Pre-fetch and pre-load instructions • 160 f. p. registers • 2 f. p. pipelines • 4 flops/cycle 8 Memory Switch Pre-fetch Pre-load Cache Load Floating Point Registers Arithmetic Unit All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Slide Window Registers Logical 126 -7 124 -7 Physical 16 to 31 32 to 125 0 to 15 Base=2 32 to 123 0 to 15 Base=4 Sliding part: 0 to 127 Global part: 128 to 159 • Registers for all instructions • Registers for extended instructions only • Fixed registers: 4, 8, 16, 32 (16 illustrated) • Fixed + sliding = 128 9 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Instruction Set Extensions • Load and store with extended registers • Floating point arithmetic with extended registers • Slide window control • Pre-fetch and pre-load • Thread start-up and finish • Predicate instructions 10 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

SR 8000 Programming Instruction Level Parallelism (Pseudo-vector Processing: PVP) All Rights Reserved. Copyright ©

Pre-fetch and Pre-load • Pre-fetch: load cache line from memory to cache • Pre-load: load one word from memory to register • 16 streams Main Memory Switch Pre-fetch Pre-load Cache Load Floating Point Registers Arithmetic Unit 12 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Pre-fetch Iteration 1 PF Latency LD Use data LD 2 LD 3 Use data LD 4 5 Use data PF Latency Use data LD 6 Use data • Pre-fetch 128 bytes to cache • Follow by LD to register 13 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Pre-load Iteration 1 2 3 4 5 6 PL Latency PL Use data Latency

Software Pipelining No SWPL I=1 I=2 Infinite resource I=3 Finite resource I=1 I=2 I=3 Initiation interval Recurrence =a I=1 Resources: a= =a I=2 a= =a I=3 15 a= registers, f. p. units, instruction issue, memory bandwidth etc All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Pseudo-vector Processing A(: ) = A(: ) + N Pseudo-Vector PF Lat LD +

Effect of PVP Dot product: S = A(1: N)*B(1: N) 17 All Rights Reserved.

SR 8000 Programming Multi-thread Parallelism (Cooperative Microprocessors in a Single Address Space: COMPAS) All

COMPAS Multi-dimensional Crossbar Network Node IP IP IP . . IP Node Main memory (shared) Automatic Parallel Processing IP process IP IP COMPAS (Start Inst. ) Pre-fetch Load Arithmetic Store Branch thread Pre-fetch Load Arithmetic Store Branch IP: Instruction Processor COMPAS ( End Inst. ) COMPAS: Co-operative Micro-Processors in single Address Space 19 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Hardware Support Software IP IP Scalar Part IP IP (waiting for startup) Loop Part Start Parallel Inst. Loop Part End Parallel Inst. Scalar Part Hardware Support IP IP SC Barrier Synchronization Mechanism IP: Instruction Processor SC: Storage Controller MS: Main Storage MS 20 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Loop Parallelisation DO i =1, N A(i)=B(i)+C(i) ENDDO DO j=1, M W(j)=C(j)+D(j) DO i=1, N A(i, j)=B(i, j)+W(j) ENDDO i loop parallelisation j loop parallelisation 21 [fork] DO i =start, end A(i)=B(i)+C(i) ENDDO [join] [fork] DO j=start, end W(j)=C(j)+D(j) DO i=1, N A(i, j)=B(i, j)+W(j) ENDDO [join] All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Loop Parallelisation DO j=2, M DO i=1, N A(i, j) = A(i, j-1)+A(i, j) ENDDO DO i=1, N A(i) = B(i)+C(i) ENDDO DO j=1, M D(j) = E(j)*F(j) ENDDO i loop parallelisation j loop parallelisation 22 [fork] DO j=2, M DO i=start, end A(i, j) = A(i, j-1)+A(i, j) ENDDO [join] [fork] DO i=start, end A(i) = B(i)+C(i) ENDDO DO j=start, end D(j) = E(j)*F(j) ENDDO [join] All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Loop Parallelisation DO i = 1, N CALL sub(a, b, i) ENDDO *poption parallel force parallelisation *poption tlocal(a, b, i) thread local variables [fork] DO i = 1, N CALL sub(a, b, i) ENDDO [join] 23 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Section Parallelisation Execution of independent blocks of code in different threads (sections are always single threaded) *poption parallel_sections *poption section CALL SUB 1 *poption section CALL SUB 2 *poption end_parallel_sections 24 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Effect of COMPAS Dot product: S = A(1: N)*B(1: N) 25 All Rights Reserved.

SR 8000 Programming Message Passing (MPI) All Rights Reserved. Copyright © 2000 Hitachi Europe

Remote DMA Normal Transfer Protocol Processing Context Switch Interrupt Handling Remote DMA Transfer Node Program data memory copy Send Buffer No Buffering in Kernel No OS System Call OS memory copy OS data Receive Buffer Crossbar Network 27 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Inter-node MPI Cross-bar Inter-node Network MPI MPI One MPI process per node; RDMA transfer

Intra-node MPI Cross-bar Inter-node Network MPI MPI Shared memory MPI MPI One MPI process

MPI Ping-pong 30 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

SR 8000 Parallelism Instruction level (PVP) Message passing (MPI) Multi-thread (COMPAS) Node 1 31

SR 8000 Programming Memory Architecture All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Memory Hierarchy 32 b/cyc Other IPs fp registers (128+32) L 1 cache (128 Kb 4 -way) 16 b/cyc Store buffer (16 entries) 16 b/cyc Switch Memory (2 to 16 Gb, 512 banks) 33 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Address Translation Virtual address Page offset Virtual page number Main Cache recently used entries

Large TLB Virtual address Page offset Virtual page number Large TLB covers whole address space with 256 entries. Page size 16 Mb to 128 Mb Main Large memory page table 35 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Memory Address Hashing 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 xor memory controller data path storage controller data path 36 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Key Features of SR 8000 • • • High performance RISC CPU with PVP High performance node with COMPAS High sustained memory bandwidth High scalability with fast network Low energy and space requirements 37 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

SR 8000 Programming Performance All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Top 500 – June 2000 Manufacturer Computer Rmax Installation Site 1 Intel ASCI Red 2379 Sandia National Lab 2 IBM ASCI Blue Pacific 2144 Lawrence Livermore National Lab 3 SGI ASCI Blue Mountain 1608 Los Alamos National Lab 4 IBM SP Power 3 375 MHz 1417 NAVOCEANO 5 Hitachi SR 8000 -F 1/112 1035 LRZ Munich 6 Hitachi SR 8000 -F 1/100 917 KEK Tsukuba 7 Cray Inc T 3 E/1200 891 US Government 8 Cray Inc T 3 E/1200 891 US Army HPC Research Center 9 Hitachi SR 8000/128 873 University of Tokyo T 3 E/1200 815 US Government 10 Cray Inc 39 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Linpack Performance 1000 20. 50 Gflops on 2 nodes 800 700 GFlops 917. 15 (100 nodes) 10. 88 Gflops on 1 node 900 40. 76 Gflops on 4 nodes 600 605. 30 (64 nodes) 577. 49 (60 nodes) 500 400 313. 32 (32 nodes) 300 200 159. 51 (16 nodes) 80. 25 (8 nodes) 100 0 0 20 40 60 80 Number of nodes 40 100 120 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

NAS Parallel FT 35 30 Class. B GFlops 25 Class. C 20 14. 8415. 10 14. 01 15 10 5 28. 78 27. 95 26. 16 Class. A 5. 145. 39 8. 37 7. 928. 31 0 1 2 4 Number of Nodes 41 8 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.