Configuration and Programming of Heterogeneous Multiprocessors on a

  • Slides: 43
Download presentation
Configuration and Programming of Heterogeneous Multiprocessors on a Multi-FPGA System Using TMD-MPI by Manuel

Configuration and Programming of Heterogeneous Multiprocessors on a Multi-FPGA System Using TMD-MPI by Manuel Saldaña, Daniel Nunes, Emanuel Ramalho, and Paul Chow University of Toronto Department of Electrical and Computer Engineering 3 rd International Conference on Re. Con. Figurable Computing and FPGAs (Re. Con. Fig 06) San Luis Potosi, Mexico Manuel Saldaña September, 2006

Agenda n n Motivation Background ¡ ¡ ¡ n n New Developments Example Application

Agenda n n Motivation Background ¡ ¡ ¡ n n New Developments Example Application ¡ ¡ n TMD-MPI Classes of HPC Design Flow Heterogeneity test Scalability test Conclusions 1/17/2022 Manuel Saldaña 2

Motivation How Do We Program This? • 64 -Micro. Blaze MPSo. C • (Ring,

Motivation How Do We Program This? • 64 -Micro. Blaze MPSo. C • (Ring, 2 D-Mesh) topologies • XC 4 VLX 160 Not the largest one! 1/17/2022 Manuel Saldaña 3

Motivation How Do We Program This? Network 512 -Micro. Blaze Multiprocessor System 1/17/2022 Manuel

Motivation How Do We Program This? Network 512 -Micro. Blaze Multiprocessor System 1/17/2022 Manuel Saldaña 4

Background: Classes of HPC Machines • Class 1 Machines • Supercomputers or clusters of

Background: Classes of HPC Machines • Class 1 Machines • Supercomputers or clusters of workstations Interconnection Network 06/09/2006 Connections 2006 5

Background: Classes of HPC Machines • Class 1 Machines • Supercomputers or clusters of

Background: Classes of HPC Machines • Class 1 Machines • Supercomputers or clusters of workstations Interconnection Network • Class 2 Machines • Hybrid network of CPU and FPGA hardware • FPGA acts as external co-processor to CPU 06/09/2006 Connections 2006 Interconnection Network 6

Background: Classes of HPC Machines • Class 1 Machines • Supercomputers or clusters of

Background: Classes of HPC Machines • Class 1 Machines • Supercomputers or clusters of workstations Interconnection Network • Class 2 Machines • Hybrid network of CPU and FPGA hardware • FPGA acts as external co-processor to CPU Interconnection Network • Class 3 Machines • FPGA-based multiprocessor • Recent area of academic and industrial focus 06/09/2006 Connections 2006 Interconnection Network 7

Background: MPSo. C and MPI n MPSo. C (Class 3) has many similarities to

Background: MPSo. C and MPI n MPSo. C (Class 3) has many similarities to typical multiprocessor computers (Class 1), but also many special requirements n Similar concepts but different implementations n MPI for MPSo. C is desirable (TIMA labs, Open. FPGA, Berkeley BEE 2, U. of Queensland, U. Rey Juan Carlos, Uof. T TMD, . . . ) n MPI is a broad standard and designed for big machines n MPI Implementations are too big for embedded systems 1/17/2022 Manuel Saldaña 8

Background: TMD-MPI MPSo. C (TMD-MPI) Linux Cluster (MPICH) m. P Network m. P the

Background: TMD-MPI MPSo. C (TMD-MPI) Linux Cluster (MPICH) m. P Network m. P the same code… m. P 1/17/2022 Manuel Saldaña m. P 9

Background: TMD-MPI m. P m. P Network m. P Use multiple chips to have

Background: TMD-MPI m. P m. P Network m. P Use multiple chips to have massive resources m. P m. P Network m. P 1/17/2022 TMD-MPI hides the complexity Manuel Saldaña m. P Network m. P 10

Background: TMD-MPI Implementation Layers MPI_Barrier Application MPI_Send/MPI_Recv MPI Application Interface Point-to-Point MPI Communication Functions

Background: TMD-MPI Implementation Layers MPI_Barrier Application MPI_Send/MPI_Recv MPI Application Interface Point-to-Point MPI Communication Functions csend/send TMD-MPI fsl_cput / fsl_put (macros) Hardware Access Functions Hardware 1/17/2022 put/get (assembly instructions) Manuel Saldaña 11

Background: TMD-MPI Functions Implemented n Miscellaneous ¡ MPI_Init ¡ MPI_Finalize ¡ MPI_Comm_Rank ¡ MPI_Comm_Size

Background: TMD-MPI Functions Implemented n Miscellaneous ¡ MPI_Init ¡ MPI_Finalize ¡ MPI_Comm_Rank ¡ MPI_Comm_Size ¡ MPI_Wtime 1/17/2022 n Point-to-Point ¡ MPI_Send ¡ MPI_Recv n Collective Operations ¡ MPI_Barrier ¡ MPI_Bcast ¡ MPI_Gather ¡ MPI_Reduce Manuel Saldaña 12

Background: Design Flow Flexible Hardware-Software Co-design Flow Previous work: • Patel et al. [1]

Background: Design Flow Flexible Hardware-Software Co-design Flow Previous work: • Patel et al. [1] (FCCM 2006) • Saldaña et al. [2] (FPL 2006) Re. Con. Fig 06 1/17/2022 Manuel Saldaña 13

New Developments TMD-MPI for Micro. Blaze TMD-MPI for Power. PC 405 TMD-MPE for Hardware

New Developments TMD-MPI for Micro. Blaze TMD-MPI for Power. PC 405 TMD-MPE for Hardware engines 1/17/2022 Manuel Saldaña 14

New Developments: TMD-MPE and TMD-MPI light TMD-MPI m. P TMD-MPI light Hardware Engine With

New Developments: TMD-MPE and TMD-MPI light TMD-MPI m. P TMD-MPI light Hardware Engine With Message-Passing 1/17/2022 Manuel Saldaña 15

New Developments TMD-MPE uses the Rendezvous message-passing protocol 1/17/2022 Manuel Saldaña 16

New Developments TMD-MPE uses the Rendezvous message-passing protocol 1/17/2022 Manuel Saldaña 16

New Developments TMD-MPE includes: n n message queues to keep track of unexpected messages

New Developments TMD-MPE includes: n n message queues to keep track of unexpected messages top queue packetizing/depacketizing logic to handle large messages 1/17/2022 Manuel Saldaña 17

Heterogeneity Test Heat Equation Application / Jacobi Iterations TMD-MPI Observe the change of TMD-MPE

Heterogeneity Test Heat Equation Application / Jacobi Iterations TMD-MPI Observe the change of TMD-MPE temperature distribution over time TMD-MPI 1/17/2022 Manuel Saldaña 18

Heterogeneity Test Heat Equation Application / Jacobi Iterations TMD-MPI TMD-MPE TMD-MPI 1/17/2022 Manuel Saldaña

Heterogeneity Test Heat Equation Application / Jacobi Iterations TMD-MPI TMD-MPE TMD-MPI 1/17/2022 Manuel Saldaña 19

Heterogeneity Test MPSo. C Heterogeneous Configurations (9 Processing Elements, single FPGA) 1/17/2022 Manuel Saldaña

Heterogeneity Test MPSo. C Heterogeneous Configurations (9 Processing Elements, single FPGA) 1/17/2022 Manuel Saldaña 20

Heterogeneity Test Execution Time Jacobi Engines PPC 405 Micro. Blazes 1/17/2022 Manuel Saldaña 21

Heterogeneity Test Execution Time Jacobi Engines PPC 405 Micro. Blazes 1/17/2022 Manuel Saldaña 21

Scalability Test • Heat Equation Application • 5 FPGAS (XC 2 VP 100) (7

Scalability Test • Heat Equation Application • 5 FPGAS (XC 2 VP 100) (7 m. B + 2 PPC 405 per FPGA) • 45 Processing Elements (35 m. B + 10 PPC 405) 1/17/2022 Manuel Saldaña 22

Scalability Test Fixed-size Speedup up to 45 Processors 1/17/2022 Manuel Saldaña 23

Scalability Test Fixed-size Speedup up to 45 Processors 1/17/2022 Manuel Saldaña 23

Uof. T TMD Prototype 1/17/2022 Manuel Saldaña 24

Uof. T TMD Prototype 1/17/2022 Manuel Saldaña 24

Conclusions n TMD-MPI and TMD-MPE enable the parallel programming of heterogeneous MPSo. C across

Conclusions n TMD-MPI and TMD-MPE enable the parallel programming of heterogeneous MPSo. C across multiple FPGAs including hardware engines n TMD-MPI hides the complexity of using heterogeneous links n The Heat equation application code was executed in a Linux Cluster and in our multi-FPGA system with minimal changes n TMD-MPI can be adapted to a particular architecture n TMD prototype is a good platform for further research on MPSo. C 1/17/2022 Manuel Saldaña 25

References [1] Arun Patel, Christopher Madill, Manuel Saldaña, Christopher Comis, Régis Pomès, and Paul

References [1] Arun Patel, Christopher Madill, Manuel Saldaña, Christopher Comis, Régis Pomès, and Paul Chow. A Scalable FPGA-based Multiprocessor. In IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’ 06), April 2006 [2] Manuel Saldaña and Paul Chow. TMD-MPI: An MPI Implementation for Multiple Processors across Multiple FPGAs. In IEEE International Conference on Field-Programmable Logic and Applications (FPL 2006), August 2006. 1/17/2022 Manuel Saldaña 26

Thank you! (¡Gracias!) 1/17/2022 Manuel Saldaña 27

Thank you! (¡Gracias!) 1/17/2022 Manuel Saldaña 27

Rendezvous Overhead Rendezvous Synchronization Overhead 1/17/2022 Manuel Saldaña 28

Rendezvous Overhead Rendezvous Synchronization Overhead 1/17/2022 Manuel Saldaña 28

Testing the Functionality TMD-MPIbench on-chip communication round-trip tests Internal RAM (BRAM) off-chip communication on-chip

Testing the Functionality TMD-MPIbench on-chip communication round-trip tests Internal RAM (BRAM) off-chip communication on-chip communication External RAM (DDR) 1/17/2022 off-chip communication Manuel Saldaña 29

TMD-MPI Implementation TMD-MPI communication protocols 1/17/2022 Manuel Saldaña 30

TMD-MPI Implementation TMD-MPI communication protocols 1/17/2022 Manuel Saldaña 30

Communication Tests TMD-MPIbench. c n n round trip bisection bandwidth ¡ round trips with

Communication Tests TMD-MPIbench. c n n round trip bisection bandwidth ¡ round trips with congestion ¡ worst case traffic scenario all-node broadcasts synchronization performance (barriers/sec) 1/17/2022 Manuel Saldaña 31

Communication Tests Latency: Testbed (internal link) @ 40 MHz 17 m. S Testbed (external

Communication Tests Latency: Testbed (internal link) @ 40 MHz 17 m. S Testbed (external link) @ 40 MHz 22 m. S P 3 -NOW 100 Mb/s Ethernet 75 m. S P 4 -Cluster 1000 Mb/s Gigabit Ethernet 92 m. S 1/17/2022 Manuel Saldaña 32

Communication Tests Micro. Blaze throughput limit with external RAM 1/17/2022 Manuel Saldaña 33

Communication Tests Micro. Blaze throughput limit with external RAM 1/17/2022 Manuel Saldaña 33

Communication Tests Micro. Blaze throughput limit with internal RAM Memory access time Micro. Blaze

Communication Tests Micro. Blaze throughput limit with internal RAM Memory access time Micro. Blaze throughput limit with external RAM 1/17/2022 Manuel Saldaña 34

Communication Tests Measured Bandwidth @ 40 MHz Startup Overhead Frequency P 4 -Cluster P

Communication Tests Measured Bandwidth @ 40 MHz Startup Overhead Frequency P 4 -Cluster P 3 -NOW 1/17/2022 Manuel Saldaña 35

Communication Tests 1/17/2022 Manuel Saldaña 36

Communication Tests 1/17/2022 Manuel Saldaña 36

Many variables are involved… 1/17/2022 Manuel Saldaña 37

Many variables are involved… 1/17/2022 Manuel Saldaña 37

Background: TMD-MPI provides a parallel programming model for MPSo. C in FPGAs with the

Background: TMD-MPI provides a parallel programming model for MPSo. C in FPGAs with the following features: ¡ Portability - application unaffected by changes in HW ¡ Flexibility - to move from generic to application-specific ¡ Scalability - for large scale applications ¡ Reusability - do not learn a new API for similar applications 1/17/2022 Manuel Saldaña 38

Testing the Functionality Hardware Testbed 1/17/2022 Manuel Saldaña 39

Testing the Functionality Hardware Testbed 1/17/2022 Manuel Saldaña 39

Testing the Functionality Hardware Testbed 1/17/2022 Manuel Saldaña 40

Testing the Functionality Hardware Testbed 1/17/2022 Manuel Saldaña 40

New Developments: TMD-MPE use and the network 1/17/2022 Manuel Saldaña 41

New Developments: TMD-MPE use and the network 1/17/2022 Manuel Saldaña 41

Background: TMD-MPI n TMD-MPI ¡ ¡ ¡ is a lightweight subset of the MPI

Background: TMD-MPI n TMD-MPI ¡ ¡ ¡ is a lightweight subset of the MPI standard is tailored to a particular application does not require an Operating System has a small memory footprint ~8. 7 KB uses a simple protocol 1/17/2022 Manuel Saldaña 42

New Developments: TMD-MPE and TMD-MPI light TMD-MPI light 1/17/2022 Manuel Saldaña 43

New Developments: TMD-MPE and TMD-MPI light TMD-MPI light 1/17/2022 Manuel Saldaña 43