COE 561 Programmable Logic and Storage Devices Dr
COE 561 Programmable Logic and Storage Devices Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum & Minerals
Outline n History of Computational Fabrics n ASIC vs. FPGA n Reconfigurable Logic n Anti-Fuse-Based Approach (Actel) n RAM Based Field Programmable Logic (Xilinx) n CLBs n Carry & Control Logic n FPGA Memory Implementation 1 -2
History of Computational Fabrics n Discrete devices: relays, transistors (1940 s-50 s) n Discrete logic gates (1950 s-60 s) n Integrated circuits (1960 s-70 s) n n • e. g. TTL packages: Data Book for 100’s of different parts Gate Arrays (IBM 1970 s) • Transistors are pre-placed on the chip & Place and Route software puts the chip together automatically – only program the interconnect (mask programming) Software Based Schemes (1970’s- present) • Run instructions on a general purpose core 1 -3
History of Computational Fabrics n ASIC Design (1980’s to present) • Turn Verilog directly into layout using a library of standard • n cells Effective for high-volume and efficient use of silicon area Programmable Logic (1980’s to present) • A chip that is reprogrammed after it has been fabricated • Examples: PALs, PLAs, EPROM, EEPROM, PLDs, FPGAs • Excellent support for mapping from Verilog 1 -4
What is an FPGA? n A filed programmable gate array (FPGA) is a reprogrammable silicon chip. n Using prebuilt logic blocks and programmable routing resources, you can configure these chips to implement custom hardware functionality without ever having to pick up a breadboard or soldering iron. n You develop digital computing tasks in software and compile them down to a configuration file or bitstream that contains information on how the components should be wired together. 1 -5
ASIC vs. FPGA ASIC Application Specific Integrated Circuit • designs must be sent for expensive and time consuming fabrication in semiconductor foundry FPGA Field Programmable Gate Array • bought off the shelf and reconfigured by designers themselves • no physical layout design; design ends with • designed all the way a bitstream used from behavioral description to configure a device to physical layout 1 -6
ASIC vs. FPGA ASICs High performance FPGAs Off-the-shelf Low development cost Low power Short time to market Low cost in high volumes Reconfigurability 1 -7
Other FPGA Advantages n Manufacturing cycle for ASIC is very costly, lengthy and engages lots of manpower • Mistakes not detected at design time have large impact on • development time and cost FPGAs are perfect for rapid prototyping of digital circuits n Easy upgrades like in case of software n FPGA provide a flexible platform for implementing digital computing n A rich set of macros and I/Os supported (multipliers, block RAMS, ROMS, high-speed I/O) n A wide range of applications from prototyping (to validate a design before ASIC mapping) to high performance spatial computing 1 -8
How are FPGAs Used? n Prototyping • Ensemble of gate arrays used to • n emulate a circuit to be manufactured Get more/better/faster debugging done than with simulation Reconfigurable hardware • One hardware block used to implement more than one function n Special-purpose computation engines • Hardware dedicated to solving one • problem (or class of problems) Accelerators attached to generalpurpose computers (e. g. , in a cell phone!) 1 -9
FPGA Market n The FPGA market was valued at USD 5. 34 Billion in 2016 and is expected to be valued at 9. 50 Billion in 2023 to grow at a CAGR of 8. 5% between 2017 and 2023. n The growing demand for advanced driver-assistance systems (ADAS), the growth of Io. T and reduction in time-to-market are the key driving factors for the market. n Third of cloud service providers will be deploying FPGAs in data centers by 2020. 1 -10
Major FPGA Vendors SRAM-based FPGAs n Xilinx, Inc. 49% of the market n Intel (Altera Corp. ) 40% of the market n Lattice Semiconductor 6% of the market Flash & antifuse FPGAs n Microsemi (Actel) 4% of the market n Quick Logic Corp. 1% of the market 1 -11
Reconfigurable Logic 1 -12
Anti-Fuse-Based Approach (Actel) 1 -13
Actel Logic Module Example Gate Mapping Combinational Block S-R Latch 1 -14
Actel Routing & Programming 1 -15
RAM Based Field Programmable Logic - Xilinx 1 -16
Xilinx FPGA Families n n n Old families • • XC 3000, XC 4000, XC 5200 Old 0. 5µm, 0. 35µm and 0. 25µm technology. Not recommended for modern designs. High-performance families • • Virtex (0. 22µm) Virtex-E, Virtex-EM (0. 18µm) Virtex-II, Virtex-II PRO (0. 13µm) Virtex-4 (0. 09µm) Low Cost Family • • Spartan/XL – derived from XC 4000 Spartan-II – derived from Virtex Spartan-IIE – derived from Virtex-E Spartan-3, Spartan-6 1 -17
FPGA Nomenclature 1 -18
Device Part Marking 1 -19
The Xilinx 4000 CLB 1 -20
Two 4 -input Functions, Registered Output and a Two Input Function 1 -21
5 -input Function, Combinational Output 1 -22
5 -Input Functions implemented using two LUTs 1 -23
LUT Mapping n N-LUT direct implementation of a truth table: any function of n-inputs. n N-LUT requires 2 N storage elements (latches) n N-inputs select one latch location (like a memory) 1 -24
Configuring the CLB as a RAM 1 -25
Xilinx 4000 Interconnect 1 -26
Xilinx 4000 Interconnect Details 1 -27
Xilinx 4000 Flexible IOB 1 -28
Basic I/O Block Structure 1 -29
IOB Functionality n IOB provides interface between the package pins and CLBs n Each IOB can work as uni- or bi-directional I/O n Outputs can be forced into High Impedance n Inputs and outputs can be registered n Inputs can be delayed • advised for high-performance I/O 1 -30
Additional Features in Modern FPGAs 1 -31
Spartan-3 Xilinx FPGA Block Diagram 1 -32
CLB Structure 1 -33
CLB Slice Structure n Each slice contains two sets of the following: • Four-input LUT • Any 4 -input logic function, • or 16 -bit x 1 sync RAM • or 16 -bit shift register • Carry & Control • Fast arithmetic logic • Multiplier logic • Multiplexer logic • Storage element • • Latch or flip-flop Set and reset True or inverted inputs Sync. or async. control 1 -34
Xilinx Multipurpose LUT (MLUT) 16 x 1 ROM (logic) 1 -35
5 -Input Functions implemented using two LUTs n One CLB Slice can implement any function of 5 inputs n Logic function is partitioned between two LUTs n F 5 multiplexer selects LUT 1 -36
Distributed RAM n CLB LUT configurable as Distributed RAM • • A LUT equals 16 x 1 RAM Cascade LUTs to increase RAM size n Synchronous write n Synchronous/Asynchronous read n • Accompanying flip-flops used for synchronous read Two LUTs can make • • • 32 x 1 single-port RAM 16 x 2 single-port RAM 16 x 1 dual-port RAM 1 -37
Shift Register n Each LUT can be configured as shift register • n n Serial in, serial out Dynamically addressable delay up to 16 cycles For programmable pipeline Cascade for greater cycle delays Use CLB flip-flops to add depth 1 -38
Shift Register n Register-rich FPGA • Allows for addition of pipeline stages to increase throughput n Data paths must be balanced to keep desired functionality 1 -39
Carry & Control Logic 1 -40
Fast Carry Logic n Each CLB contains separate logic and routing for the fast generation of sum & carry signals • Increases efficiency and performance of adders, subtractors, accumulators, comparators, and counters n Carry logic is independent of normal logic and routing resources n All major synthesis tools can infer carry logic for arithmetic functions 1 -41
The Virtex II CLB (Half Slice Shown) 1 -42
Adder Implementation 1 -43
Carry Chain 1 -44
18 x 18 Embedded Multiplier n Embedded 18 -bit x 18 -bit multiplier • 2’s complement signed operation n Multipliers are organized in columns n Fast arithmetic functions • Optimized to implement multiply / accumulate modules 1 -45
Design Flow - Mapping n Technology Mapping: Schematic/HDL to Physical Logic units • Compile functions into basic LUT-based groups (function of target architecture) 1 -46
Design Flow – Placement & Route n Placement – assign logic location on a particular device n Routing – iterative process to connect CLB inputs/outputs and IOBs. Optimizes critical path delay – can take hours or days for large, dense designs Challenge! Cannot use full chip for reasonable speeds (wires are not ideal). Typically no more than 50% utilization. 1 -47
Example: Verilog to FPGA 1 -48
Memory Types 1 -49
Single-Port and Dual-Port Distributed RAM 1 -50
LUT-Based RAMS 1 -51
LUT-Based RAMS 1 -52
LUT-Based RAM Modules 1 -53
LUT-Based RAM Modules 1 -54
Instantiating LUT-Based RAM Modules defparam //RAM initialization (“ 0” by default) for functional simulation: U_RAM 16 X 1 S. INIT = 16'h. A 2 F 5; // Add 0=1; Add 1=0; Add 2=1; Add 3=0 //Distributed RAM Instantiation RAM 16 X 1 S U_RAM 16 X 1 S (. D(), // insert input signal. WE(), // insert Write Enable signal. WCLK(), // insert Write Clock signal. A 0(), // insert Address 0 signal. A 1(), // insert Address 1 signal. A 2(), // insert Address 2 signal. A 3(), // insert Address 3 signal. O() // insert output signal ); 1 -55
Instantiating LUT-Based RAM Modules defparam U_RAM 16 X 1 D. INIT = 16'h 0000; //Distributed Select. RAM Instantiation RAM 16 X 1 D U_RAM 16 X 1 D ( . D(), // insert input signal . WE(), // insert Write Enable signal . WCLK(), // insert Write Clock signal . A 0(), // insert Address 0 signal port SPO . A 1(), // insert Address 1 signal port SPO . A 2(), // insert Address 2 signal port SPO . A 3(), // insert Address 3 signal port SPO . DPRA 0(), // insert Address 0 signal port DPO . DPRA 1(), // insert Address 1 signal port DPO . DPRA 2(), // insert Address 2 signal port DPO . DPRA 3(), // insert Address 3 signal port DPO . SPO(), // insert output signal . DPO() // insert output signal ); 1 -56
Example of Inferred Memory module Memory_Unit #(parameter word_size=8, address_size=4, memory_size=16)( output [word_size-1: 0] data_out, input [word_size-1: 0] data_in, input [address_size-1: 0] address, input clk, write); Synthesizing Unit <Memory_Unit>. Related source file is "Memory. v". Found 16 x 8 -bit single-port RAM <Mram_memory> for signal <memory>. Summary: inferred 1 RAM(s). HDL Synthesis Report reg [word_size-1: 0] memory[memory_size-1: 0]; Macro Statistics initial begin # RAMs : 1 memory[0]=1; memory[1]=2; 16 x 8 -bit single-port distributed RAM: 1 memory[2]=3; memory[3]=4; end assign data_out = memory[address]; always @ (posedge clk) if (write) memory[address] <= data_in; endmodule 1 -57
Example of Inferred Memory module Memory_Unit 2 #(parameter word_size=8, address_size=4, memory_size=16)( output reg [word_size-1: 0] data_out, input [word_size-1: 0] data_in, input [address_size-1: 0] address, input clk, write); reg [word_size-1: 0] memory[memory_size-1: 0]; initial begin memory[0]=1; memory[1]=2; memory[2]=3; memory[3]=4; end always @ (posedge clk) begin if (write) memory[address] <= data_in; data_out <= memory[address]; endmodule Synthesizing Unit <Memory_Unit>. Related source file is "Memory. v". Found 16 x 8 -bit single-port RAM <Mram_memory> for signal <memory>. Found 8 -bit register for signal data_out Summary: inferred 1 RAM(s). inferred 8 D-type flip-flop(s). HDL Synthesis Report Macro Statistics # RAMs : 1 16 x 8 -bit single-port distributed RAM: 1 # Registers : 1 8 -bit register : 1 1 -58
Example of Inferred Memory module Memory_Unit 3 #(parameter Synthesizing Unit <Memory_Unit>. word_size=8, address_size=8, Related source file is "Memory. v". memory_size=256)( Found 256 x 8 single-port RAM output reg [word_size-1: 0] data_out, <Mram_memory> for signal <memory>. input [word_size-1: 0] data_in, Found 8 -bit register for signal data_out input [address_size-1: 0] address, Summary: input clk, write); inferred 1 RAM(s). inferred 8 D-type flip-flop(s). reg [word_size-1: 0] memory[memory_size-1: 0]; HDL Synthesis Report initial begin memory[0]=1; memory[1]=2; Macro Statistics memory[2]=3; memory[3]=4; # RAMs : 1 end 256 x 8 -bit single-port Block RAM: 1 always @ (posedge clk) begin if (write) memory[address] <= data_in; data_out <= memory[address]; end 1 -59 endmodule
Example of Inferred Memory module Memory_Unit 4#(parameter word_size=8, address_size=8, memory_size=256)( output [word_size-1: 0] data_out 1, output [word_size-1: 0] data_out 2, input [word_size-1: 0] data_in, input [address_size-1: 0] address 1, input [address_size-1: 0] address 2, input clk, write); Synthesizing Unit <Memory_Unit>. Related source file is "Memory. v". Found 256 x 8 dual-port RAM <Mram_memory> for signal <memory>. Summary: inferred 1 RAM(s). reg [word_size-1: 0] memory[memory_size-1: 0]; assign data_out 1 = memory[address 1]; assign data_out 2 = memory[address 2]; always @ (posedge clk) begin if (write) memory[address 1] <= data_in; endmodule Macro Statistics # RAMs : 1 256 x 8 -bit dual-port distributed RAM: 1 HDL Synthesis Report 1 -60
Example of Inferred Memory module Memory_Unit 5#(parameter word_size=8, address_size=8, memory_size=256)( output reg [word_size-1: 0] data_out 1, output reg [word_size-1: 0] data_out 2, input [word_size-1: 0] data_in, input [address_size-1: 0] address 1, input [address_size-1: 0] address 2, input clk, write); Synthesizing Unit <Memory_Unit>. Related source file is "Memory. v". Found 256 x 8 dual-port RAM <Mram_memory> for signal <memory>. Found 8 -bit register for signal <data_out 2>. Found 8 -bit register for signal <data_out 1>. Summary: inferred 1 RAM(s). inferred 16 D-type flip-flop(s) HDL Synthesis Report reg [word_size-1: 0] memory[memory_size-1: 0]; always @ (posedge clk) begin if (write) memory[address 1] <= data_in; Macro Statistics else data_out 1 <= memory[address 1]; # RAMs : 1 data_out 2 <= memory[address 2]; 256 x 8 -bit dual-port block RAM: 1 endmodule 1 -61
Block RAM n Most efficient memory implementation n Ideal for most memory requirements • Dedicated blocks of memory • 4 to 104 memory blocks • 18 kbits = 18, 432 bits per block (16 k without parity bits) • Use multiple blocks for larger memories n Builds both single and true dual-port RAMs n Synchronous write and read (different from distributed RAM) 1 -62
Block RAM n Support of two independent 9 Kb blocks, or a single 18 Kb block RAM. n Simple or true dual-port mode can be used. n Simple dual-port mode is defined as having one readonly port and one write-only port with independent clocks. n 18 or 36 -bit wide ports can have an individual write enable per byte. This feature is popular for interfacing to an on-chip microprocessor. n All inputs are registered with the port clock and have a setup-to-clock timing specification. 1 -63
Block RAM n A write operation requires one clock edge. n A read operation requires one clock edge. n All output ports are latched. The state of the output port does not change until the port executes another read or write operation. The default block RAM output is latch mode. n The output data path has an optional internal pipeline register. Using the register mode is strongly recommended. This allows a higher clock rate, however, it adds a clock cycle latency of one. 1 -64
Block RAM 1 -65
Block RAM Logic Diagram 1 -66
Block RAM Data Combinations and ADDR Locations 1 -67
Block RAM Port Aspect Ratios 1 -68
Dual-Port Bus Flexibility n Each port can be configured with a different data bus width n Provides easy data width conversion without any additional logic 1 -69
Simple Dual-Port Mode Allowed Combinations for 9 Kb Block RAM 1 -70
True Dual-Port Mode Allowed Combinations for 9 Kb Block RAM 1 -71
18 Kb Block RAM—True Dual-Port Operation 1 -72
Read & Write Operations n Read Operation • In latch mode, the read operation uses one clock edge. The • n read address is registered on the read port, and the stored data is loaded into the output latches after the RAM access time. When using the output register, the read operation will take one extra latency cycle to arrive at the output. Write Operation • A write operation is a single clock-edge operation. The write address is registered on the write port, and the data input is stored in memory. 1 -73
Write Modes n Three settings of the write mode determine the behavior of the data available on the output latches after a write clock edge: WRITE_FIRST, READ_FIRST, and NO_CHANGE. n The Write mode attribute can be individually selected for each port. The default mode is WRITE_FIRST. n WRITE_FIRST outputs the newly written data onto the output bus. n READ_FIRST outputs the previously stored data while new data is being written. n NO_CHANGE maintains the output previously generated by a read operation. 1 -74
WRITE_FIRST or Transparent Mode (Default) n In WRITE_FIRST mode, the input data is simultaneously written into memory and stored in the data output (transparent write). 1 -75
READ_FIRST or Read-Before-Write Mode n In READ_FIRST mode, data previously stored at the write address appears on the output latches, while the input data is being stored in memory (read before write). 1 -76
NO_CHANGE Mode n In NO_CHANGE mode, the output latches remain unchanged during a write operation. 1 -77
Conflict Avoidance n Block RAM memory is a true dual-port RAM where both ports can access any memory location at any time. n When accessing the same memory location from both ports, the user must, however, observe certain restrictions. n There are no timing restrictions when both ports perform a read operation. n When one port performs a write operation, the other port must not read or write the exact same memory location. 1 -78
Spartan-3 Block RAM Amounts 1 -79
Spartan-3 FPGA Family Members 1 -80
Virtex-II 1. 5 V Architecture 1 -81
Virtex-II 1. 5 V Device CLB Array Slices Maximum I/O Block. RAM (18 kb) Multiplier Blocks Distributed RAM bits XC 2 V 40 8 x 8 256 88 4 4 8, 192 XC 2 V 80 16 x 8 512 120 8 8 16, 384 XC 2 V 250 24 x 16 1, 536 200 24 24 49, 152 XC 2 V 500 32 x 24 3, 072 264 32 32 98, 304 XC 2 V 1000 40 x 32 5, 120 432 40 40 163, 840 XC 2 V 1500 48 x 40 7, 680 528 48 48 245, 760 XC 2 V 2000 56 x 48 10, 752 624 56 56 344, 064 XC 2 V 3000 64 x 56 14, 336 720 96 96 458, 752 XC 2 V 4000 80 x 72 23, 040 912 120 737, 280 XC 2 V 6000 96 x 88 33, 792 1, 104 144 1, 081, 344 XC 2 V 8000 112 x 104 46, 592 1, 108 168 1, 490, 944 1 -82
7 Series FPGA Families 1 -83
CLB Structure n Two side-by-side slices per CLB n Four 6 -input LUTs per slice • Slice_M are memory-capable • Slice_L are logic and carry only • Consistent with previous architectures • Single LUT in Slice_M can be a 32 -bit shift register or 64 x 1 RAM n Two flip-flops per LUT • Excellent for heavily pipelined designs 1 -84
Block RAM n 36 K/18 K block RAM • All Xilinx 7 series FPGA families use same block RAM as Virtex-6 FPGAs n Configurations same as Virtex-6 FPGAs • 32 k x 1 to 512 x 72 in one 36 K • • block Simple dual-port and true dual-port configurations Built-in FIFO logic 64 -bit error correction coding per 36 K block Adjacent blocks combine to 64 K x 1 without extra logic 1 -85
DSP Slice • All 7 series FPGAs share the same DSP slice • 25 x 18 multiplier • 25 -bit pre-adder • Flexible pipeline • Cascade in and out • Carry in and out • 96 -bit MACC • SIMD support • 48 -bit ALU • Pattern detect • 17 -bit shifter • Dynamic operation (cycle by cycle) 1 -86
Using Core Generator 1 -87
Single Port BRAM 1 -88
Single Port BRAM 1 -89
Single Port BRAM 1 -90
Single Port BRAM 1 -91
Dual Port BRAM 1 -92
Dual Port BRAM 1 -93
Dual Port BRAM 1 -94
Distributed RAM 1 -95
Distributed RAM 1 -96
Distributed RAM 1 -97
Memory Initialization ********************************* Example of Dual Port Block Memory. COE file ************************************** ; Sample memory initialization file for Dual Port Block ; Memory, This. COE file specifies the contents for a block ; memory of depth=16, and width=4. In this case, values ; are specified in binary format. memory_initialization_radix=2; memory_initialization_vector= 1111, 1111, 0000, 0101, 0011, 0000, 1111, 1111, 1111; 1 -98
Memory Initialization ********************************* Example of Single Port Block Memory. COE file ************************************** ; Sample memory initialization file for Single Port Block ; Memory, ; v 3. 0 or later. This. COE file specifies ; initialization values for a block memory of depth=16, and ; width=8. In this case, values are specified in hexadecimal ; format. memory_initialization_radix=16; memory_initialization_vector= ff, ab, f 0, 11, 00, 01, aa, bb, cc, dd, ef, ee, ff, 00, ff; 1 -99
Simulating Memory Generated by Core. Gen `timescale 1 ns / 1 ps module Bram. Test; reg clka; reg [0: 0] wea; reg [3: 0] addra; reg [7: 0] dina; wire [7: 0] douta; ram uut (. clka(clka), . wea(wea), . addra(addra), . dina(dina), . douta(douta) ); initial begin clka = 0; forever #50 clka = ~ clka; end 1 -100
Simulating Memory Generated by Core. Gen initial begin // Initialize Inputs wea = 0; addra = 0; dina = 0; #100 addra=1; #100 addra=2; #100 addra=3; #100 addra=5; #100 addra=6; endmodule 1 -101
- Slides: 101