CS 295 Modern Systems What Are FPGAs and

  • Slides: 22
Download presentation
CS 295: Modern Systems What Are FPGAs and Why Should You Care Sang-Woo Jun

CS 295: Modern Systems What Are FPGAs and Why Should You Care Sang-Woo Jun Spring, 2019

What Are FPGAs q Field-Programmable Gate Array q Can be configured to act like

What Are FPGAs q Field-Programmable Gate Array q Can be configured to act like any circuit – More later! q Can do many things, but we focus on computation acceleration

FPGAs Come In Many Forms PCIe-Attached CPU Integrated In-Storage In-Network

FPGAs Come In Many Forms PCIe-Attached CPU Integrated In-Storage In-Network

How Is It Different From CPU/GPUs q GPU – The other major accelerator q

How Is It Different From CPU/GPUs q GPU – The other major accelerator q CPU/GPU hardware is fixed o “General purpose” o we write programs (sequence of instructions) for them q FPGA hardware is not fixed o “Special purpose” o Hardware can be whatever we want o Will our hardware require/support software? Maybe! q Optimized hardware is very efficient o GPU-level performance** o 10 x power efficiency (300 W vs 30 W)

Analogy CPU/GPU comes with fixed circuits FPGA gives you a big bag of components

Analogy CPU/GPU comes with fixed circuits FPGA gives you a big bag of components To build whatever “The Z-Berry” “Experimental Investigations on Radiation Characteristics of IC Chips” benryves. com “Z 80 Computer” Shadi Soundation: Homebrew 4 bit CPU Could be a CPU/GPU!

Fine-Grained Parallelism of Special-Purpose Circuits q A = G × m 1 C =

Fine-Grained Parallelism of Special-Purpose Circuits q A = G × m 1 C = x 1 - x 2 E = y 1 - y 2 B = A × m 2 D = C 2 F = E 2 G = D + F Ret = B / G 4 cycles with basic operations A = G × m 1 × m 2 B = (x 1 - x 2)2 C = (y 1 - y 2)2 D = B + C Ret = B / G 3 cycles with compound operations May slow down clock Ret = (G × m 1 × m 2) / ((x 1 - x 2)2 + (y 1 - y 2)2) 1 cycle with even further compound operations

Coarse-Grained Parallelism of Special-Purpose Circuits q Typical unit of parallelism for general-purpose units are

Coarse-Grained Parallelism of Special-Purpose Circuits q Typical unit of parallelism for general-purpose units are threads ~= cores q Special-purpose processing units can also be replicated for parallelism o Large, complex processing units: Few can fit in chip o Small, simple processing units: Many can fit in chip q Only generates hardware useful for the application o Instruction? Decoding? Cache? Coherence?

How Is It Different From ASICs q ASIC (Application-Specific Integrated Circuit) o Special chip

How Is It Different From ASICs q ASIC (Application-Specific Integrated Circuit) o Special chip purpose-built for an application o E. g. , ASIC bitcoin miner, Intel neural network accelerator o Function cannot be changed once expensively built q + FPGAs can be field-programmed o Function can be changed completely whenever o FPGA fabric emulates custom circuits q - Emulated circuits are not as efficient as bare-metal o ~10 x performance (larger circuits, faster clock) o ~10 x power efficiency

Basic FPGA Architecture “Configurable logic block (CLB)” Programmable ~ I/O block Latch 6 -Input

Basic FPGA Architecture “Configurable logic block (CLB)” Programmable ~ I/O block Latch 6 -Input Look-Up Table FF Ex) 2 -LUT for “AND” Input 1 Programmable interconnect Input 2 Output 0 0 1 1 1 Sequential circuit construction

Basic FPGA Architecture – DSP Blocks “DSP block” q CLBs act as gates –

Basic FPGA Architecture – DSP Blocks “DSP block” q CLBs act as gates – Many needed to implement high-level logic q Arithmetic operation provided as efficient ALU blocks o “Digital Signal Processing (DSP) blocks” o Each block provides an adder + multiplier × +/-

Basic FPGA Architecture – Block RAM “Block RAM” q CLB can act as flip-flops

Basic FPGA Architecture – Block RAM “Block RAM” q CLB can act as flip-flops o (~1 bit/block) – tiny! q Some on-chip SRAM provided as blocks o ~18/36 Kbit/block, MBs per chip o Massively parallel access to data → multi. TB/s bandwidth

Basic FPGA Architecture – Hard Cores Memory Ethernet ARM PCIe q Some functions are

Basic FPGA Architecture – Hard Cores Memory Ethernet ARM PCIe q Some functions are provided as efficient, non-configurable “hard cores” o o o Multi-core ARM cores (“Zynq” series) Multi-Gigabit Transceivers PCIe/Ethernet PHY Memory controllers …

Example Accelerator Card Architecture q “FPGA Mezzanine Card” Expansion General-Purpose I/O Pins o Network

Example Accelerator Card Architecture q “FPGA Mezzanine Card” Expansion General-Purpose I/O Pins o Network Ports, Memory, Storage, PCIe, … Multi-Gigabit Transceivers FMC DRAM 1 Gb. E FPGA 40 Gb. E PCIe DRAM

Example Accelerator Card (VCU 108)

Example Accelerator Card (VCU 108)

Programming FPGAs q Languages and tools overlap with ASIC/VLSI design q FPGAs for acceleration

Programming FPGAs q Languages and tools overlap with ASIC/VLSI design q FPGAs for acceleration typically done with either o Hardware Description Languages (HDL): Register-Transfer Level (RTL) languages o High-Level Synthesis: Compiler translates software programming languages to RTL q RTL models a circuit using: o Registers (state), and o Combinational logic (computation)

Hardware Description Language q Software programming languages: Describes process q Hardware description languages: Describes

Hardware Description Language q Software programming languages: Describes process q Hardware description languages: Describes structure std: : queue<float> input_queue; std: : queue<float> output_queue; float factor; Exists in memory while (true) { if ( !input_queue. empty() ) { ret = input_queue. front() * factor; output_queue. push(ret) input_queue. pop(); } } Instructions For CPU FIFO#(Float) input_queue <- mk. FIFO; FIFO#(Float) output_queue <- mk. FIFO; Reg#(Float) factor <- mk. Reg; Float. Mult. Ifc mult <- mk. Float. Mult; Exists on chip rule in; mult. enq(factor, input_queue. first); input_queue. deq; endrule out; ret <- mult. result; output_queue. enq(ret); endrule Creates circuits

Major Hardware Description Languages q Verilog: Most widely used in industry o Relatively low-level

Major Hardware Description Languages q Verilog: Most widely used in industry o Relatively low-level language supported by everyone q Chisel – Compiles to Verilog o Relatively high-level language from Berkeley o Embedded in the Scala programming language o Prominently used in RISC-V development (Rocket core, etc) q Bluespec – Compiles to Verilog o Relatively high-level language from MIT o Supports types, interfaces, etc o Also active RISC-V development (Piccolo, etc)

High-Level Synthesis q Compiler translates software programming languages to RTL q High-Level Synthesis compiler

High-Level Synthesis q Compiler translates software programming languages to RTL q High-Level Synthesis compiler from Xilinx, Altera/Intel o o Compiles C/C++, annotated with #pragma’s into RTL Theory/history behind it is a complex can of worms we won’t go into Personal experience: needs to be HEAVILY annotated to get performance Anecdote: Naïve RISC-V in Vivado HLS achieves IPC of 0. 0002 [1], 0. 04 after optimizations [2] q Open. CL o Inherently parallel language more efficiently translated to hardware o Stable software interface [1] http: //msyksphinz. hatenablog. com/entry/2019/02/20/040000 [2] http: //msyksphinz. hatenablog. com/entry/2019/02/27/040000

FPGA Compilation Toolchain High-Level HDL Code Language Compiler Verilog/ VHDL High-level language vendor tool

FPGA Compilation Toolchain High-Level HDL Code Language Compiler Verilog/ VHDL High-level language vendor tool “Which transceiver instance should top_transceiver_01 map to? ” And so, so much more… Constraint File Functional Simulation Cycle-level Simulation FPGA Vendor toolchain (Few open source) Synthesize Netlist Map/ Place/ Route Bitfile

Programming/Using an FPGA Accelerator q Bitfile is programmed to FPGA over “JTAG” interface o

Programming/Using an FPGA Accelerator q Bitfile is programmed to FPGA over “JTAG” interface o Typically used over USB cable o Supports FPGA programming, limited debugging access, etc q PCIe-attached FPGA accelerator card is typically used similarly to GPUs o Program FPGA, execute software o Software copies data to FPGA board, notify FPGA -> FPGA logic performs computations -> Software copies data back from FPGA q FPGA flexibility gives immense freedom of usage patterns o Streaming, coherent memory, …

Partial Reconfiguration FPGA Sub-components q Parts of the FPGA can be swapped out dynamically

Partial Reconfiguration FPGA Sub-components q Parts of the FPGA can be swapped out dynamically without turning off FPGA o Physical area is drawn on chip q Used in Amazon F 1, etc q Toolchain support for isolation

FPGAs In The Cloud q Amazon EC 2 F 1 instance (1 – 4

FPGAs In The Cloud q Amazon EC 2 F 1 instance (1 – 4 FPGAs) q Microsoft Azure, etc…