Automated Systolic Array Architecture Synthesis for High Throughput

  • Slides: 11
Download presentation
Automated Systolic Array Architecture Synthesis for High Throughput CNN Inference on AWS F 1

Automated Systolic Array Architecture Synthesis for High Throughput CNN Inference on AWS F 1 FPGA Xuechao Wei, Peng Zhang, Cody Hao Yu, and Jim Wu Center for Energy-efficient Computing and Applications, School of EECS, Peking University, China Computer Science Department, University of California, Los Angeles, CA, USA Falcon Computing Solutions, Inc, Los Angeles, CA, USA

Falcon Computing Solutions Ø An early stage company focused on FPGA-based acceleration solutions with

Falcon Computing Solutions Ø An early stage company focused on FPGA-based acceleration solutions with offices in Santa Clara, Los Angeles, and Beijing Ø Vision : Provide seamless acceleration solutions that deliver high performance and energy efficiency for compute-intensive applications on-premises or in the cloud Ø Leveraging years of research under co-founder Dr. Jason Cong § Chancellor’s professor and Director of the Center for Domain-Specific Computing at UCLA Ø Have raised more than $10 M in venture funding in the past 2 years Ø Executive team from Intel, Altera, Xilinx, Synopsys, Magma § >30 years in FPGA industry, >30 years of University Research

DNN Design Challenges on FPGAs Ø High performance and throughput DNN architecture Ø Maximum

DNN Design Challenges on FPGAs Ø High performance and throughput DNN architecture Ø Maximum resource utilization and frequency on SSI devices Ø Irregularities in different DNN layers Ø Design portability across different DNN models and FPGA devices 3

Systolic Array Architecture 4

Systolic Array Architecture 4

Stacked Systolic Array Architecture 5

Stacked Systolic Array Architecture 5

Two-Phase Design Space Exploration for(i = 0; i < 128; i++) for(o = 0;

Two-Phase Design Space Exploration for(i = 0; i < 128; i++) for(o = 0; o < 192; o++) for(c = 0; c < 13; c++) for(r = 0; r < 13; r++) for(p = 0; p < 3; p++) for(q = 0; q < 3; q++) out[o][r][c] += w[o][i][p][q] * in[i][r+p][c+q]; Determine the single systolic array structure Determine the number of systolic arrays Stream buffer management 6

Programming Model 7

Programming Model 7

Merlin Compiler Ø Ø Ø Pure C/C++ based flow enabling SW programmers to develop

Merlin Compiler Ø Ø Ø Pure C/C++ based flow enabling SW programmers to develop FPGA accelerated applications C/C++ K Merlin Compiler Highly integrated flow with automatic optimization greatly improving productivity Advanced code transformation delivering highest Qo. R without FPGA expertise Kernel Code to Accelerate GCC Library Merlin Optimization FPGA Binary CPU FPGA K 8

Experiment Result Baseline: single layer implementation on AWS F 1 Irregularity: Stacked Systolic Array

Experiment Result Baseline: single layer implementation on AWS F 1 Irregularity: Stacked Systolic Array w/o floorplanning – higher computation efficiency Frequency: Stacked Systolic Array w floorplanning – higher clock frequency 9

Summary Ø A low latency DNN accelerator design based on stacked systolic arrays achieving

Summary Ø A low latency DNN accelerator design based on stacked systolic arrays achieving 2 TOPs on AWS F 1 FPGA Ø An automated resource partitioning algorithm between systolic arrays and FPGA dies for multiple DNN layers achieving 90% resource utilization and 240 MHz Ø An end-to-end automation flow from high-level C code to FPGA accelerated DNNs in datacenters. We implement a push -button automation 10

THANK YOU!

THANK YOU!