ape NEXT The ape NEXT multiTFlops LGT supercomputer

  • Slides: 40
Download presentation
ape. NEXT * The ape. NEXT multi-TFlops LGT supercomputer: architecture description and project status

ape. NEXT * The ape. NEXT multi-TFlops LGT supercomputer: architecture description and project status report * The ape. NEXT project Roberto De Pietri (depietri@fis. unipr. it) Università di Parma & INFN gruppo collegato di Parma San Diego, March 27 th 2003 Roberto De Pietri -- chep 03 1

The APE family Our line of Home Made Computers APE 100 APEmille ape. NEXT

The APE family Our line of Home Made Computers APE 100 APEmille ape. NEXT (1988) (1993) (1999) (2003) Architecture SIMD++ # nodes 16 2048 4096 Topology flexible 1 D rigid 3 D flexible 3 d flexible 3 D Memory 256 MB 8 GB 64 GB 1 TB # registers (w. size) 64 (x 32) 128 (x 32) 512 (x 64) clock speed 8 MHz 25 MHz 66 MHz 200 MHz ~ 2 TFlops ~ 8 -20 TFlops Total Computing Power of all … ~1. 5 GFlops ~ 250 GFlops San Diego, March 27 th 2003 Roberto De Pietri -- chep 03 2

APE (‘ 88) 1 GFlops San Diego, March 27 th 2003 Roberto De Pietri

APE (‘ 88) 1 GFlops San Diego, March 27 th 2003 Roberto De Pietri -- chep 03 3

The APE paradigm n Very efficient for LQCD q q n Large number of

The APE paradigm n Very efficient for LQCD q q n Large number of register q n n n The normal operation as a basic operation Native implementation of the complex type a x b + c (complex numbers) Efficient optimizations VLIW (very long instruction word) Reliable and safe HW solution Easy to program software tools q q APEse, TAO Machine simulator San Diego, March 27 th 2003 Roberto De Pietri -- chep 03 4

Since APE 100 n Our own designed VLSI q n 3 D topology q

Since APE 100 n Our own designed VLSI q n 3 D topology q q n Pipelined normal operation on a chip (MAD) Remote I/O and X - link ON CABLE Y and Z – link on the BACKPLANE Large number of APEmille installation in Europe q q 30 crate (~ 65 GFlops) Almost 2 Tera. Flops of computing power San Diego, March 27 th 2003 Roberto De Pietri -- chep 03 5

APEmille installations n Bielefeld Zeuthen Milan Bari Trento Pisa Rome 1 Rome 2 Orsay

APEmille installations n Bielefeld Zeuthen Milan Bari Trento Pisa Rome 1 Rome 2 Orsay Swansea 130 GF 520 GF 130 GF 65 GF 325 GF 520 GF 130 GF 16 GF 65 GF n Gr. Total ~1966 GF n n n n n San Diego, March 27 th 2003 (2 crates) (8 crates) (2 crates) (1 crates) (5 crates) (8 crates) (2 crates) (1/4 crates) (1 crates) Roberto De Pietri -- chep 03 6

The ape. NEXT architecture X+(cables) n 3 D mesh of computing nodes 12 13

The ape. NEXT architecture X+(cables) n 3 D mesh of computing nodes 12 13 15 8 9 n Each node is a: complete self-sufficient computing engine (1. 6 GFlops) San Diego, March 27 th 2003 14 10 11 4 5 6 7 0 1 Z+(bp) 2 Y+(bp) 3 DDR-MEM J&T Roberto De Pietri -- chep 03 X+ … … Z- 7

The ape. NEXT architecture (2) X+(cables) n n n Two directions (Y, Z) on

The ape. NEXT architecture (2) X+(cables) n n n Two directions (Y, Z) on the backplane Direction X through front panel cables System topologies: q q Processing Board sub. Crate (16 PB) Crate (32 PB) Large systems San Diego, March 27 th 2003 12 13 15 8 9 10 11 4 5 6 7 0 1 Z+(bp) 4 x 2 ~ 26 GF 4 x 8 ~ 0. 4 TF 8 x 8 ~ 0. 8 TF (8*n) x 8 Roberto De Pietri -- chep 03 14 2 Y+(bp) 3 DDR-MEM J&T X+ … … Z- 8

Components (1) The CHIP The J&T chip is the core of ape. NEXT and

Components (1) The CHIP The J&T chip is the core of ape. NEXT and everything is built around it !! San Diego, March 27 th 2003 Roberto De Pietri -- chep 03 9

Components (2) n J&T Module q q 1 J&T Chip 9 DRAM chips 256

Components (2) n J&T Module q q 1 J&T Chip 9 DRAM chips 256 Mbits memory chips 1024 Mbits memory chips (supported) San Diego, March 27 th 2003 Roberto De Pietri -- chep 03 10

Components (3) n Processing Board X+(cables) 13 9 5 1 12 15 8 11

Components (3) n Processing Board X+(cables) 13 9 5 1 12 15 8 11 4 7 0 3 14 10 6 2 Y+(bp) Z+(bp) San Diego, March 27 th 2003 Roberto De Pietri -- chep 03 11

Components (4) n Back Plane Z+, Z- links Y+, Y- links q q X+(cables)

Components (4) n Back Plane Z+, Z- links Y+, Y- links q q X+(cables) 13 9 5 1 12 15 8 11 4 7 0 3 14 10 6 2 Y+(bp) Z+(bp) San Diego, March 27 th 2003 Roberto De Pietri -- chep 03 12

Components (5) Standard 48 Volt Power Supplies n Standard 1 U rack mounted PC

Components (5) Standard 48 Volt Power Supplies n Standard 1 U rack mounted PC The Cabinet San Diego, March 27 th 2003 Roberto De Pietri -- chep 03 13

Host Interface 7 th-Link (200 MB/s) San Diego, March 27 th 2003 Roberto De

Host Interface 7 th-Link (200 MB/s) San Diego, March 27 th 2003 Roberto De Pietri -- chep 03 I 2 C: bootstrap & control 14

Host I/O Interface • PCI Interface 64 bit, 66 Mhz • Altera APEX II

Host I/O Interface • PCI Interface 64 bit, 66 Mhz • Altera APEX II based • PCI Master Mode for 7 th Link Intf • 7 th Link: 1(2) bidir chan. (200*9 M/s) PCI Master Ctrl • PCI Target Mode for I 2 C Intf QDR Mem Ctrl Fifo QDR Mem Bank Fifo 7 Link Ctrl • I 2 C: 4 independent ports • Quad. Data. Rate Memory (x 32) 7 Link Ctrl PCITarget Ctrl 7 th Link Port I 2 C Ctrl PCI Interface PLDA Altera APEXII I 2 C (x 4) PCI (64 bit, 66 Mhz) San Diego, March 27 th 2003 Roberto De Pietri -- chep 03 PCI form factor 15

PB • Collaboration with NEURICAM spa • 16 Nodes 3 D-Interconnected • 4 x

PB • Collaboration with NEURICAM spa • 16 Nodes 3 D-Interconnected • 4 x 2 x 2 Topology 26 Gflops, 4. 6 GB Memory • Light System: – J&T Module connectors – Glue Logic (Clock tree 10 Mhz) – Global signal interconnection (FPGA) – DC-DC converters (48 V to 3. 3/2. 5/1. 8 V) • Dominant Technologies: – LVDS: 1728 (16*6*2*9) differential signals 200 MB/s, 144 routed via cables, 576 via backplane on 12 controlled-impedance (100 W) layers – High-Speed differential connectors: • Samtec QTS (J&T Module) • Erni ERMET-ZD (Backplane) San Diego, March 27 th 2003 Roberto De Pietri -- chep 03 16

J&T Module q q q q J&T 9 DDR-SDRAM, 256 Mbit (x 16) memory

J&T Module q q q q J&T 9 DDR-SDRAM, 256 Mbit (x 16) memory chips 6 Link LVDS up to 400 MB/s Host Fast I/O Link (7 th Link) I 2 C Link (slow control network) Dual Power 2. 5 V + 1. 8 V, 7 -10 W estimated Dominant technologies: n n SSTL-II (memory interface) LVDS (network interface + I/O) San Diego, March 27 th 2003 Roberto De Pietri -- chep 03 17

Overview of the J&T Architecture n Peak floating point performance of about 1. 6

Overview of the J&T Architecture n Peak floating point performance of about 1. 6 Gflops q n n Integer arithmetic performance of about 400 Mips Link bandwidth of about 200 Mbyte/sec each q q n n IEEE compliant double precision full duplex 7 links: X+, X-, Y+, Y-, Z+, Z- and the 7 th link Support for current generation DDR memory Memory bandwidth of 3. 2 Gbyte/sec q 400 Mword/sec San Diego, March 27 th 2003 Roberto De Pietri -- chep 03 18

J&T n Computing & control integrated q q no glue logic Reduced time for

J&T n Computing & control integrated q q no glue logic Reduced time for project, simulation and test of the prototype San Diego, March 27 th 2003 Roberto De Pietri -- chep 03 19

J&T: Top Level Diagram San Diego, March 27 th 2003 Roberto De Pietri --

J&T: Top Level Diagram San Diego, March 27 th 2003 Roberto De Pietri -- chep 03 20

The J&T Arithmetic BOX • Pipelined complex “normal” a*b+c (8 flops) per cycle 4

The J&T Arithmetic BOX • Pipelined complex “normal” a*b+c (8 flops) per cycle 4 multipliers 4 adder/sub At 200 MHz (fully piped) = 1. 6 GFlops San Diego, March 27 th 2003 Roberto De Pietri -- chep 03 21

The J&T remote IO n fifo-based communication: q LVDS q 1. 6 Gb/s per

The J&T remote IO n fifo-based communication: q LVDS q 1. 6 Gb/s per link q (8 bit @ 200 MHz) 6 (+1) independent links San Diego, March 27 th 2003 Roberto De Pietri -- chep 03 22

J&T summary n n n n CMOS 0. 18 m, 7 metal (ATMEL) 200

J&T summary n n n n CMOS 0. 18 m, 7 metal (ATMEL) 200 MHz Double Precision Complex Normal Operation 64 bit AGU 8 KW program cache 128 bit local memory channel 6+1 LVDS 200 MB/s links BGA package, 600 pins San Diego, March 27 th 2003 Roberto De Pietri -- chep 03 23

Key steps of the J&T design January 2001: ✔ May 2001: ✔ November 2001:

Key steps of the J&T design January 2001: ✔ May 2001: ✔ November 2001: ✔ ✔ February 2002: VHDL design starts Contract with Atmel established First placement experiment started Major rework on the network protocol (to increase robustness against transmission errors). April 2002: ✔June 2002 (end): ✔ Network OK, re-start placement exercises Good placement available Satisfactory routing available July 2002(beginning): Power routing not OK and ✔ 5% of “random logic” removed ✔July 2002(end): Both problems solved ✔ ………………Continues on next slides. . . . San Diego, March 27 th 2003 Roberto De Pietri -- chep 03 24

Key steps of the J&T design (2) September 2002: ✔ September 2002: ✔October 2002:

Key steps of the J&T design (2) September 2002: ✔ September 2002: ✔October 2002: ✔November 2002: th ✔Dec. 9 2002: ✔ January 2003: ✔January 2003: ✔February 2003: problems ✔February 2003: ✔End of March ✔ San Diego, March 27 th 2003 New placement available (with new power layout) Excessive congestion. . OR Very bad timing closure Satisfactory placement OK successful routing completed. Timing analysis reasonably satisfactory Simulations with back annotation OK Analysis of critical path (dangerous and not) Hammering down remaining timing Careful analysis of all risky corners Transfer of simulation data to Atmel Final sign off (Laura …. is working on it…. . ) Roberto De Pietri -- chep 03 25

Timing n n J&T ready June 03 q We will receive between 300 to

Timing n n J&T ready June 03 q We will receive between 300 to 600 chips q We need 256 processor to assemble a crate !! We expect them to work !! q q q n Everything else ready and tested q n The same team designed 7 ASICs of similar complexity Impressive full-detailed simulations of multiple J&T systems More one simulate less one has to test !! Within days/weeks the first working ape. NEXT computer will operate September ’ 03 mass production will star (hopefully) at Neuricam q INFN already founded 8 TFlops of computing power !! San Diego, March 27 th 2003 Roberto De Pietri -- chep 03 26

Mechanics a 1 AIR-FLOW CHANNEL b 1 1 DC/DC PB constraints: • Power consumption:

Mechanics a 1 AIR-FLOW CHANNEL b 1 1 DC/DC PB constraints: • Power consumption: up to 340 W a 3 AIR-FLOW CHANNEL 3 • PB-BP insertion force: 80 -150 Kg (!) b 3 • Fully populated PB weight: 4 -5 Kg ape. NEXT PB J&T Module Board-to-Board Connector Detailed study of airflow Custom design of card frame and insertion tool AIR-FLOW CHANNEL b 2 2 a 2 AIR-FLOW CHANNEL 3 J&T Module Frame TOP VIEW ( local ) San Diego, March 27 th 2003 Roberto De Pietri -- chep 03 27

PB Prototype • T, V, I monitored; • Interfaced to I 2 C control

PB Prototype • T, V, I monitored; • Interfaced to I 2 C control network San Diego, March 27 th 2003 Roberto De Pietri -- chep 03 28

PB (preliminary)Test • Next Test-Bed: metal frame with power supply • I 2 C

PB (preliminary)Test • Next Test-Bed: metal frame with power supply • I 2 C Test i. e. test of “slow-control” I/O intf. • minimal set of components assembled • simple/short test (1 week) • done succesfully (Dec 01) • Clock distribution test • PB LVDS characterization San Diego, March 27 th 2003 Roberto De Pietri -- chep 03 29

PB Status Activity Status Who Cost PB development (inc. feasibility study and LVDS EVB)

PB Status Activity Status Who Cost PB development (inc. feasibility study and LVDS EVB) Done Neuricam 67 KEuro PB ver. 1 prototypes (3) Done Neuricam DDI 10 KEuro Neuricam SOMACIS 10 KEuro Neuricam 23 KEuro PB ver. 2 prototypes (3) J&T Module develop. San Diego, March 27 th 2003 Done Roberto De Pietri -- chep 03 Note 30

NEXT Back. Plane • 16 PB Slots + Root Slot • Size 447 x

NEXT Back. Plane • 16 PB Slots + Root Slot • Size 447 x 600 mm 2 • 4600 LVDS differential signals, point-to-point up to 600 Mb/s • 16 controlled-imp. layers (32 Tot) • Press-fit only • Erni/Tyco connectors • ERMET-ZD • Providers: APW (primary) ERNI (2 nd source) Activity Status BP development Done BP prototypes (3) Done San Diego, March 27 th 2003 Who Cost Note connector kit. KEuro cost: 7 KEuro (!) APW(ERNI) 32 41 KEuro PBAPW Insertion force: 80 -150 Kg(!) Roberto De Pietri -- chep 03 31

Host I/O Interface 7 th Link Port PCI Master Ctrl QDR Mem Ctrl Fifo

Host I/O Interface 7 th Link Port PCI Master Ctrl QDR Mem Ctrl Fifo • Altera APEX II based QDR Mem Bank • PCI Interface 64 bit, 66 Mhz Fifo • PCI Master Mode for 7 th Link Intf 7 Link Ctrl • PCI Target Mode for I 2 C Intf 7 Link Ctrl • 7 th Link: 1(2) bidir chan. (200*9 M/s) PCITarget Ctrl • I 2 C: 4 indipendent ports I 2 C Ctrl Altera APEXII PCI Interface PLDA • Quad. Data. Rate Memory (x 32) I 2 C (x 4) PCI form factor PCI (64 bit, 66 Mhz) Activity Status Who Altera design Done INFN PCB design and prototypes Done NEURICAM San Diego, March 27 th 2003 Roberto De Pietri -- chep 03 Cost Note 3 KE 32

Cabinets • Problem: • PB weight: 4 -5 Kg, PB consumption: 340 W (est.

Cabinets • Problem: • PB weight: 4 -5 Kg, PB consumption: 340 W (est. ) • 32 PB + 2 Root Board (2 independent subcrates) • Power supply: (<48 Vx 150 A per subcrate) • Integrated Host PCs • Forced air cooling • Robust, expandable/modular, CE, EMC. . • Solution: • 42 U rack (h: 2, 10 m): • EMC proof, • efficient cables routing • 19”-1 U slots per 9 “host PCs” (rack mounted) • Hot-swap power supply cabinet (modular) • Custom design of “card cage” and “tie bar” • Custom design of cooling system Activity Status Who Cost Design of rack (inc. selection of power supply) Done (Apr ’ 02) APW (NEURICAM) 50 KEuro Full rack. March prototype San Diego, 27 th 2003 Done (Sept ’ 02) APW Roberto De Pietri -- chep 03 8 -10 KEuro Note 33

San Diego, March 27 th 2003 Roberto De Pietri -- chep 03 34

San Diego, March 27 th 2003 Roberto De Pietri -- chep 03 34

Software n TAO compilers and linker …. . READY q q n Kernel of

Software n TAO compilers and linker …. . READY q q n Kernel of PHYSICS codes q n All existing APE program will run with no change Physical code already been run on the simulator used to benchmark the efficiencies of the FP unit C COMPILER q q gcc (2. 93) and lcc have be retargeted lcc WORKS (almost). Factor 5 on performance http: //www. cs. princeton. edu/software/lcc/ San Diego, March 27 th 2003 Roberto De Pietri -- chep 03 35

Project Costs n Total development cost of 1700 k€uro q q 1050 k€uro for

Project Costs n Total development cost of 1700 k€uro q q 1050 k€uro for VLSI development 550 k€uro non VLSI n Manpower involved = 20 man/year n Mass production cost ~0. 5 €uro/MFlops San Diego, March 27 th 2003 Roberto De Pietri -- chep 03 36

Conclusions n J&T ready June 03 (300…. 600 chips) n Everything else ready and

Conclusions n J&T ready June 03 (300…. 600 chips) n Everything else ready and tested !!! n If tests ok q n All components over-dimensioned q n mass production starting September ‘ 03 at Neuricam Cooling, LVDS tested @ 400 Mb/s, power supply on boards … Makes possible a technology step with no extra design and test effort San Diego, March 27 th 2003 Roberto De Pietri -- chep 03 37

Conclusions (2) n Installation plans q q q n INFN 8 TFlops (10 cabinets)

Conclusions (2) n Installation plans q q q n INFN 8 TFlops (10 cabinets) already approved (on delivering of a working machine) DESY Considering between 8 TFlops to 16 TFlops Paris ………. Inversion of Dirac Operator (APEmill program) 54 % efficiency on the VHDL hardware simulator q Communications, memory refresh, synchronization wait ……. . all included … San Diego, March 27 th 2003 Roberto De Pietri -- chep 03 38

ape. NEXT vs. cluster 409. 6 GFlops 72. 5 GFlops 819. 2 GFlops 1.

ape. NEXT vs. cluster 409. 6 GFlops 72. 5 GFlops 819. 2 GFlops 1. 6*16*16 *2 GFlops San Diego, March 27 th 2003 Roberto De Pietri -- chep 03 39

ASICs of similar complexity n ADD 322 q n n n 3 input integer

ASICs of similar complexity n ADD 322 q n n n 3 input integer Adder. Prototype for APE 100 integrated into ZCPU MAD q APE 100 Floating point engine q APE 100 Sequencer + Integer ALU + AGU q APE 100 Communication device ZCPU Commuter T 1000 q APEmille Integer ALU+AGU+Program controller q APEmille Floating point engine q APEmille Communication device J 1000 COMM 1000 San Diego, March 27 th 2003 Roberto De Pietri -- chep 03 40