Supercomputers Special Course of Computer Architecture H Amano

  • Slides: 68
Download presentation
Supercomputers Special Course of Computer Architecture H. Amano

Supercomputers Special Course of Computer Architecture H. Amano

Supercomputers: Contents • What are supercomputers? • Architecture of Supercomputers • Representative supercomputers •

Supercomputers: Contents • What are supercomputers? • Architecture of Supercomputers • Representative supercomputers • Exa-Scale supercomputer project

Defining Supercomputers • High performance computers mainly for scientific computation. • Huge amount of

Defining Supercomputers • High performance computers mainly for scientific computation. • Huge amount of computation for Biochemistry, Physics, Astronomy, Meteorology and etc. • Very expensive: developed and managed by national fund. • High level techniques are required to develop and manage them. • USA, Japan and China compete the top 1 supercomputer. • A large amount of national fund is used, and tends to be political news→ For example, in Japan, the supercomputer project became the target of budget review in Dec. 2009 

FLOPS • Floating Point Operation Per Second • Floating Point number • (Mantissa) × 2 (index) •

FLOPS • Floating Point Operation Per Second • Floating Point number • (Mantissa) × 2 (index) • Double precision 64 bit, Single precision 32 bit. • IEEE Standard defines the format and rounding sign Single Double index 8 mantissa 23 11 52

The range of performance 106 109 1012 1015 1018 100万 10億 1兆 1000兆 100京

The range of performance 106 109 1012 1015 1018 100万 10億 1兆 1000兆 100京 M(Mega) G(Giga) T(Tera) P(Peta) E(Exa) i. Phone 4 S 140 MFLOPS High-end PC 50 -80 GFLOPS Powerful GPU Tera-FLOPS Supercomputers 10 TFLOPS-30 PFLOPS growing ratio: 1. 9 times/year 10 PFLOPS = 1京回 in Japanese → The name 「K」 comes from it.

How to select top 1? • Top 500/Green 500: Performance of executing Linpack •

How to select top 1? • Top 500/Green 500: Performance of executing Linpack • Linpack is a kernel for matrix computation. • Scale free • Performance centric. • Godon Bell Prize • Peak Performance, Price/Performance, Special Achievement • HPC Challenge • Global HPL Matrix computation: Computation • Global Random Access: random memory access: Communication • EP stream per system:  heavy load memory access: Memory performance • Global FFT: Complicated problem requiring both memory and communication performance. • Nov.  ACM/IEEE Supercomputing Conference • Top 500、Gordon Bell Prize、HPC Challenge、Green 500 • Jun.  International Supercomputing Conference • Top 500、Green 500 • This year, now on going in Frankfrut

From SACSIS 2012 Invited Speech.

From SACSIS 2012 Invited Speech.

Top 500 2015 July(The same as 2014) Name Development Hardware Cores Performanc e TFLOPS Power(

Top 500 2015 July(The same as 2014) Name Development Hardware Cores Performanc e TFLOPS Power( KW) Tianhe-2( 天河) (China) National University of Defence Technology Intel Xeon E 5 -2692  12 C 2. 2 GHz, TH Express-2, Intel Xeon Phi 31 S 1 P 3120000 33862. 7 (54902. 4) 17808 Titan (USA) DOE/SC/Oak Ridge National Lab. Cray XK 7, Opteron 6274 16 C 2. 2 GHz, Cray Gemini Intercon. NVIDIA K 20 x 550640 17590 (27112. 5) 8209 Sequoia (USA) DOE/NNSA/LL NL Blue. Gene/Q, Power BQC 16 C 1. 6 GHz 1572864 17173. 2 (20132. 7) 7890 K (京) (Japan) RIKEN AICS SPARC VIIIfx 2. 0 GHz Tofu Interconnect Fujitsu 705024 10510 (11280) 12659. 9 Mira (USA) DOE/SC/Argo nne National Lab. Blue. Gene/Q Power BQC 1. 6 GHz 786432 8586. 6 (10066. 3 3945 Top 5 were not changed since June. 2013

Top 500 2016 July Name Development Hardware Performanc e TFLOPS Power(KW) Taihu. Light( 太湖之光) National

Top 500 2016 July Name Development Hardware Performanc e TFLOPS Power(KW) Taihu. Light( 太湖之光) National Supercomputin g Center in Wuxi Shin. Wei(神威) NRCPC 10649600 93014. 6 15371 Tianhe-2( 天河) (China) National University of Defence Technology Intel Xeon E 5 -2692 1 2 C 2. 2 GHz, TH Express-2, Intel Xeon Phi 31 S 1 P 3120000 33862. 7 (54902. 4) 17808 Titan (USA) DOE/SC/Oak Ridge National Lab. Cray XK 7, Opteron 6274 16 C 2. 2 GHz, Cray Gemini Intercon. NVIDIA K 20 x 550640 17590 (27112. 5) 8209 Sequoia (USA) DOE/NNSA/LL NL Blue. Gene/Q, Power BQC 16 C 1. 6 GHz 1572864 17173. 2 (20132. 7) 7890 K (京) (Japan) RIKEN AICS SPARC VIIIfx 2. 0 GHz Tofu Interconnect Fujitsu 705024 10510 (11280) 12659. 9 Cores Taihu. Light got the first place for the first time in 3 years.

Sunway Taihe. Light(太湖之光) • Based on a Chinese original processor Shen. Wei (神威) SW

Sunway Taihe. Light(太湖之光) • Based on a Chinese original processor Shen. Wei (神威) SW 26010 with 260 -core. • Homogeneous type using a dedicated processor. • High energy efficiency: the 3 rd place of Green 500.

Homogeneous vs. Heterogeneous • Homogeneous • Special multi-core is used. • Sequoia, Mira: Blue.

Homogeneous vs. Heterogeneous • Homogeneous • Special multi-core is used. • Sequoia, Mira: Blue. Gene Q • K: SPARC VIIIfx • Taihe. Light: Shen. Wei • Easy programming • Wide target application • 5/6 dimensional torus • Heterogeneous • CPU+Accelerators • Tienhe-2: Intel MIC(Xeon Phi) • Titan: GPU (NVIDIA Kepler) • Highly energy efficient if Kepler is used. • Programming is difficult • Target application must make the use of the accelerator. • Infiniband+Fat tree

Green 500 2014 Nov. Machine Place FLOPS/W Total k. W 1 L-CSC, Intel Xeon E

Green 500 2014 Nov. Machine Place FLOPS/W Total k. W 1 L-CSC, Intel Xeon E 5+AMD Fire. Pro GSI Helmholtz Center 5271. 81 57. 15 2 Suiren, Xeron E 5+PEZY-SC KEK(高エネ研) 4945. 63 37. 83 3 TSUBAME-KFC, Intel Xeon E 5+ NVIDIA K 20 x, Infiniband FDR Tokyo Institute of Technology 4447. 58 35. 39 4 Storm 1 Xeon E 5+ NVIDIA K 20 Cray Inc. 3962. 73 44. 54 2 Wilkes Dell T 620 Cluster, Intel Cambridge Xeon E 5+NVIDIA K 20, Infiniband University FDR 3631. 70 52. 62 Top 5 systems are accelerated with NVIDIA Kepler K 20 GPUs, coupled with Intel Xeon CPUs. PEZY-SC is an original accelerator

Green 500 2016 • Still checking but best three were announced. Name Location GFLOPS/W

Green 500 2016 • Still checking but best three were announced. Name Location GFLOPS/W 1 Shoubu Riken/Pezy 6. 7774 2 Satsuki Rikein/Pezy 6. 195 3 Taihe. Light MSC in Wuxi 6. 051

PEZY-SC [Torii 2015] SFU 4 x 4 City 2 x 2 Village 4 x

PEZY-SC [Torii 2015] SFU 4 x 4 City 2 x 2 Village 4 x 4 City L 2 D cache 64 KB PE PE L 1 D cache 2 KB 4 x 4 L 1 D City cache 2 KB L 3 cache 2 MB Prefecture City L 3 cache 2 MB Prefecture DDR 4 DDR 4 2015/12/26 修論発表 Village L 3 cache 2 MB Prefecture Host I/F & Inter Processor I/F 3 -hierarchical MIMD manycore: 4 PE x 4(Village) x 16(City) x 4(Prefecture) = 1, 024 PE ARM x 2 14

Homogeneous supercomputers • NUMA-style multiprocessors. • Remote DMA mechanism is provided, but coherent cache

Homogeneous supercomputers • NUMA-style multiprocessors. • Remote DMA mechanism is provided, but coherent cache is not supported. • Dedicated CPUs (RISC) are used. • Power. PC based multicore: Blue. Gene Q • SPARC based multicore: K • Special network/Multi-dimensional Torus • 6 -ary torus: K • 5 -ary torus: Blue. Gene Q

Why supercomputers so fast? × Because they use high freq. clock Freq. Pentium 4

Why supercomputers so fast? × Because they use high freq. clock Freq. Pentium 4 3. 2 GHz Clock freq. of High end PC Nehalem 3. 3 GHz K 2 GHz Sequoia 1. 6 GHz Fermi 1. 3 GHz 1 GHz Kepler 732 MHz 40% / year Alpha 21064 150 MHz 100 MHz 1992 The speed up of the clock is saturated in 2003. Power and heat dissipation The clock frequency of K and Sequoia is lower than that of common PCs 2000 2008 年

NUMA without coherent cache Node 0 0 Node 1 1 2 Interconnection Network Node 2 3 Node 3

NUMA without coherent cache Node 0 0 Node 1 1 2 Interconnection Network Node 2 3 Node 3 Shared Memory Processors which can work independently.

Remote DMA (user level) System Call Local Node Remote Node Sender User Kernel User

Remote DMA (user level) System Call Local Node Remote Node Sender User Kernel User Level Data Source Data Sink Kernel Agent Buffer RDMA Buffer Host I/F Protocol Engine Network Interface

IBM’s Blue. Gene Q • Successor of Blue Gene L and Blue Gene P.

IBM’s Blue. Gene Q • Successor of Blue Gene L and Blue Gene P. • Sequoia is consisting of Blue. Gene Q • 18 Power processors (16 computational, 1 control and 1 redundant) and network interfaces are provided in a chip. • Inner-chip interconnection is a cross-bar switch. • 5 dimensional Mesh/Torus • 1. 6 GHz clock.

Supercomputer 「K」 L 2 C Memory Core Core Inter Connect Controller SPARC 64 VIIIfx

Supercomputer 「K」 L 2 C Memory Core Core Inter Connect Controller SPARC 64 VIIIfx Chip 4 nodes/board 96 nodes/Lack 24 boards/Lack Tofu Interconnect       6 -D Torus/Mesh RDMA mechanism NUMA or UMA+NORMA

SACSIS 2012 Invited speech

SACSIS 2012 Invited speech

SACSIS 2012 invited speech

SACSIS 2012 invited speech

water cooling system

water cooling system

Lacks of K

Lacks of K

6 dimensional torus Tofu

6 dimensional torus Tofu

3 -ary 4 -cube 0*** 1*** 2***

3 -ary 4 -cube 0*** 1*** 2***

3 -ary 5 -cube 0**** 1**** degree: 2*n Diameter: (k-1)*n 2****

3 -ary 5 -cube 0**** 1**** degree: 2*n Diameter: (k-1)*n 2****

Heterogeneous supercomputers • Accelerators: • GPUs have been introduced in this lecture. • The

Heterogeneous supercomputers • Accelerators: • GPUs have been introduced in this lecture. • The most recent Kepler is highly energy efficient. • Intel MIC (Xeon Phi) used in Tianhe-2 is a many-core accelerator which runs with X 86 ISA. • Infiniband/Fat tree is mostly used for interconnection networks.

Kepler K 20 • All top 14 machines in Green 500 use K 20

Kepler K 20 • All top 14 machines in Green 500 use K 20 m/c/x as its accelerators. • K 20 m with fun: stand-alone workstation • K 20 c without fun: for rack-amount • K 20 X high performance • 732 MHz operational clock • lower than previous Fermi (1. 3 GHz) • 2688 single precision CUDA cores. (3. 94 TFLOPS) • 896 double precision CUDA cores. (1. 31 TFLOPS) • Highly energy efficient accelerators

Intel Xeon Phi (MIC: Many Integrated Core) • An accelerator but can run in

Intel Xeon Phi (MIC: Many Integrated Core) • An accelerator but can run in the stand-alone mode. • X 86 compatible instruction set. • 60 -80 Cores, 512 bit/SIMD instructions • 8 double precision operations can be executed in a cycle. • 1. 1 GHz clock, more than 1 TFLOPS per card

Xeon Phi Microarchitecture All cores are connected through the ring interconnect. All L 2

Xeon Phi Microarchitecture All cores are connected through the ring interconnect. All L 2 caches are coherent with directory based management. Core L 2 Cach e TD TD TD L 2 Cach e Core TD GDDR MC Core GDDR MC L 2 Cach e Core So, Xeon Phi is classified into CC (Cache Coherent) NUMA. Of course, all cores are multithreaded, and provide 512 SIMD instructions.

Peta FLOPS 11 K Japan Peak performance vs Linpack Performance 10 5 4 The

Peta FLOPS 11 K Japan Peak performance vs Linpack Performance 10 5 4 The difference is large in machines with accelerators Tianhe(天河) China Homogeneous Using GPU 3 Nebulae China 2 Jaguar USA Tsubame Japan 1 Accelerator type is energy efficient.

Arithmetic Intensity • The number of floating point calculations per read data (byte) From

Arithmetic Intensity • The number of floating point calculations per read data (byte) From Hennessy & Patterson’s Texbook

Roof Line model • Performance versus Arithmetic Intensity From Hennessy & Patterson’s Texbook Memory

Roof Line model • Performance versus Arithmetic Intensity From Hennessy & Patterson’s Texbook Memory bound Computing bound

Floating calculations (GFLOPS/sec) Which is a good accelerator? Ideal Accelerator High Performance for Problems

Floating calculations (GFLOPS/sec) Which is a good accelerator? Ideal Accelerator High Performance for Problems with Strong Arithmetic Intensity Middle Performance for Problems with wide area of Arithmetic Intensity

Example of roof line model of GPUs and Multi-cores from Hennessy& Patterson’s Textbox

Example of roof line model of GPUs and Multi-cores from Hennessy& Patterson’s Textbox

Infiniband • Point-to-point direct serial interconnection. • Using 8 b/10 b code. • Various

Infiniband • Point-to-point direct serial interconnection. • Using 8 b/10 b code. • Various types of topologies can be supported. • Multicasting/atomic transactions are supported. • The maximum throughput 1 X 4 X 12 X SDR 2 Gbit/s 8 Gbit/s 24 Gbit/s DDR 4 Gbit/s 16 Gbit/s 48 Gbit/s QDR 8 Gbit/s 32 Gbit/s 96 Gbit/s

Fat Tree Myrinet-Clos is actually a type of Fat-tree

Fat Tree Myrinet-Clos is actually a type of Fat-tree

Myrinet-Clos(1/2) • 128 nodes(Clos 128)

Myrinet-Clos(1/2) • 128 nodes(Clos 128)

Why Top 1? • Top 1 is just a measure of matrix computation. •

Why Top 1? • Top 1 is just a measure of matrix computation. • Linpack is a weak scaling benchmark with high arithmetic intensity. • Top 1 of Green 500, Gordon Bell Prize, Top 1 of each HPC Challenge program → All machines are valuable. TV or newspapers are too much focus on Top 500. • However, most top 1 computer also got Gordon Bell Prize and HPC Challenge top 1. • K and Sequoia • Impact of Top 1 is great!

Why Exa-scale supercomputers? • The ratio of serial part becomes small for the large

Why Exa-scale supercomputers? • The ratio of serial part becomes small for the large scale problem. • Linpack is scale free benchmark. • Serial execution part 1 day+Parallel execution part 10 years → 1 day+1 day: A big impact. • Are there any big programs which cannot be solved by K but can be solved by Exa-scale supercomputers? • The number of programs will be decreased. • Can we find new area of application? • It is important such a big computing power is open for researches.

Amdahl’s law Serial part 1% Parallel part 99% Accelerated by parallel processing 0. 01

Amdahl’s law Serial part 1% Parallel part 99% Accelerated by parallel processing 0. 01 + 0. 99/p 50 times with 100 cores、91 times with 1000 cores If there is a small part of serial execution part, the performance improvement is limited.

Strong Scaling vs. Weak Scaling • Strong Scaling: • The size of the problems

Strong Scaling vs. Weak Scaling • Strong Scaling: • The size of the problems (the size of treated data) is fixed. • It is difficult to improve performance by Amdahl’s low. • Weak Scaling: • The size of problems is scaled along to the size of computers. • Linpack is a weak scaling benchmark. • Discussion • For evaluation of computer architecture, weak scaling is misleading! • An extremely large super computers are developed for extremely large problems.

Japanese Exa-scale computer • Japanese national project for exa-scale computer started. • Riken are

Japanese Exa-scale computer • Japanese national project for exa-scale computer started. • Riken are developing an exa-scale computer (post-peta computer) until 2020. • Architecture is now under planning. • For exa-scale: 70, 000 cores are needed. • The limitation of budget is severer than technical limit.

Motivation and limitation • Integrated computer technologies including architecture, hardware, software, dependable techniques, semiconductors

Motivation and limitation • Integrated computer technologies including architecture, hardware, software, dependable techniques, semiconductors and application. • Flagship and symbols. • No-computer is remained in Japan other than supercomputers • A super computing power is open for peaceful researches. • It is a tool which makes impossible analysis possible. • What needs infinite computing power? • Is it a Japanese supercomputer if all cores and accelerators are made in USA? • Does floating centric supercomputer to solve LInpack as fast as possible really fit the demand? Look at Exa-scale computer project!

Should we develop a floating computation centric supercomputers? • What people wants big supercomputer

Should we develop a floating computation centric supercomputers? • What people wants big supercomputer to do? • Finding new medicines: Pattern matching. • Simulation of earthquake, Meteorology for analyzing global warming. • Big data • Artificial Intelligence • Most of them are not suitable for floating computation centric supercomputers. • “Supercomputers for big data” or “Super-cloud computers” might be required.

What is WSC? • WSC (Warehouse Scale Computing) • Google, Amazon, Yahoo, . .

What is WSC? • WSC (Warehouse Scale Computing) • Google, Amazon, Yahoo, . . etc. • An extremely large cluster with more than 50000 nodes • Consisting of economical homogeneous components. • Reliability is mainly kept by software. • Power Supply and Cooling System are important design factor • Cloud Computing is supported with such WSCs.

WSC is not a simple big data-center • Homogeneous structure • In data-center, various

WSC is not a simple big data-center • Homogeneous structure • In data-center, various types of clusters are used. Software and application packages are also various. • In WSC, homogeneous tailored hardware is used. Software is custom made or free software. • Cost • In data-center, the largest cost is often the people to maintain it. • In WSC, the server hardware is the greatest cost. • WSC likes supercomputers rather than clusters in datacenters.

Figure 6. 5 Hierarchy of switches in a WSC. (Based on Figure 1. 2

Figure 6. 5 Hierarchy of switches in a WSC. (Based on Figure 1. 2 of Barroso and Hölzle [2009]. )

Figure 6. 8 The Layer 3 network used to link arrays together and to

Figure 6. 8 The Layer 3 network used to link arrays together and to the Internet [Greenberg et al. 2009]. Some WSCs use a separate border router to connect the Internet to the datacenter Layer 3 switches.

Figure 6. 19 Google customizes a standard 1 AAA container: 40 x 8 x

Figure 6. 19 Google customizes a standard 1 AAA container: 40 x 8 x 9. 5 feet (12. 2 x 2. 4 x 2. 9 meters). The servers are stacked up to 20 high in racks that form two long rows of 29 racks each, with one row on each side of the container. The cool aisle goes down the middle of the container, with the hot air return being on the outside. The hanging rack structure makes it easier to repair the cooling system without removing the servers. To allow people inside the container to repair components, it contains safety systems for fire detection and mist-based suppression, emergency egress and lighting, and emergency power shut-off. Containers also have many sensors: temperature, airflow pressure, air leak detection, and motion-sensing lighting. A video tour of the datacenter can be found at http: //www. google. com/corporate/green/datacenters/summit. html. Microsoft, Yahoo!, and many others are now building modular datacenters based upon these ideas but they have stopped using ISO standard containers since the size is inconvenient.

Excise • A target program: serial computation part : 1 parallel computation part: N

Excise • A target program: serial computation part : 1 parallel computation part: N 3 • K: 700, 000 cores • Exa: 70, 000 cores • What N makes Exa 10 times faster than K ?

Japanese supercomputers • K-Supercomputer • Homogeneous scalar type massively parallel computers. • Earth simulator

Japanese supercomputers • K-Supercomputer • Homogeneous scalar type massively parallel computers. • Earth simulator • Vector computers • The difference between peak and Linpack performance is small. • TIT’s Tsubame • A lot of GPUs are used. Energy efficient supercomputer. • Nagasaki University’s DEGIMA • A lot of GPUs are used. Hand made supercomputer. High costperformance. Gordon Bell prize cost performance winner • GRAPE projects • For astronomy, dedicated supercomputers. SIMD、Various version won the Gordon Bell prize.

SACSIS 2012 Invited Speech

SACSIS 2012 Invited Speech

Peak performance 40 TFLOPS The earth simulator Node 1 0 1 … Vector Processor

Peak performance 40 TFLOPS The earth simulator Node 1 0 1 … Vector Processor …. Vector Processor 7 Shared Memory 16 GB Vector Processor 1 1 7 Node 0 … Vector Processor 0 0 … Vector Processor Shared Memory 16 GB Vector Processor Interconnection Network (16 GB/s x 2) 7 Node 639

Pipeline processing 1 2 3 4 5 6 Stage Each stage sends the result/receives

Pipeline processing 1 2 3 4 5 6 Stage Each stage sends the result/receives the input every clock cycle. N stages = N times performance Data dependency makes RAW hazards and degrades the performance. If the large array is treated, a lot of stages can work efficiently.

Vector computers vector registers a 0 a 1 a 2…. . adder Y=Y+X[i] multiplier

Vector computers vector registers a 0 a 1 a 2…. . adder Y=Y+X[i] multiplier X[i]=A[i]*B[i] b 0 b 1 b 2…. The classic style supercomputers since Cray-1. Earth simulator is also a vector supercomputer.

Vector computers vector registers a 1 a 2…. . adder multiplier a 0 b

Vector computers vector registers a 1 a 2…. . adder multiplier a 0 b 0 Y=Y+X[i]=A[i]*B[i] b 1 b 2….

Vector computers vector registers a 2…. . adder multiplier a 1 a 0 b

Vector computers vector registers a 2…. . adder multiplier a 1 a 0 b 0 Y=Y+X[i] b 1 X[i]=A[i]*B[i] b 2….

Vector computers vector registers a 11…. . multiplier a 10 adder x 0 Y=Y+X[i]

Vector computers vector registers a 11…. . multiplier a 10 adder x 0 Y=Y+X[i] x 1 a 9 b 10 X[i]=A[i]*B[i] b 11….

Data transfer methods for vector computers • Stride data access • A data located

Data transfer methods for vector computers • Stride data access • A data located with a certain gap are accessed continuously. • Used for sparse matrix/flexible matrix computation. • Gather/Scatter • A data distributed in a memory system are loaded with a continuous access to the vector registers. • The results in the vector registers are distributed into a memory with a continuous access. • Both functions need a large memory bank. • Powerful memory is essential for vector machines. • Cache is not so efficient, but still effective to reduce the start -up time.

mulator ple NUMA

mulator ple NUMA

TIT’s  Tsubame Well balanced supercomputer with GPUs

TIT’s  Tsubame Well balanced supercomputer with GPUs

Nagasaki Univ’s DEGIMA

Nagasaki Univ’s DEGIMA

GRAPE-DR Kei Hiraki “GRAPE-DR” http: //www. fpl. org (FPL 2007)

GRAPE-DR Kei Hiraki “GRAPE-DR” http: //www. fpl. org (FPL 2007)