Japanese 2 nd generation Dynamically Reconfigurable Processors ERSA
Japanese 2 nd generation Dynamically Reconfigurable Processors ERSA 2009 Invited Speech Hideharu Amano Keio Univ.
Commercial Products using Dynamically Reconfigurable Processors SONY PMW EX-1/3 Multifunction Printers Professional camcorder IP Flex’s DAPDNA-2 NEC electronics’ STP engine Panasonic’s Professional camcorder DFabric SONY PSP VME (Virtual Mobile Engine)
Short history of Dynamically Reconfigurable Processors 1990 1995 FPGA with Dynamic Reconfiguration MPLD(Fujitsu) WASMII(Keio) Processor with Reconfigurable Instructions 2000 2005 The 1 st Generation The 2 nd Generation Time Multiplexed FPGA(Xilinx) DFabric(Elixcent) DAPDNA/2(IPFlex) DAPDNA/IMX (IPFlex) Xpp(PACT) DRL(NEC) CS 2112(Chameleon) FE-GA(Hitachi) DRP(NEC elec. ) X-bridge (NEC ele. ) Pipe. Rench(CMU) Kilocore(Rapport) S-5(Stretch) S-6(Stretch) GARP(UCB) CHIMAERA(North. Western Univ. ) DISC(Brigham Young Univ. ) A lot of commercial systems
Most of Japanese semiconductor Companies have their own projects! Product Vendor Context Data PE D-Fabrix Panasonic Deliver 4 Homo Xpp PACT Deliver 24 Homo S 5/S 6 engine Stretch Deliver 4/8 Hetero CS 2112 Chameleon Multi-C(8) 16/32 Homo DAPDNA-2 IPFlex Multi-C(4) 32 Hetero DRP-1 NEC electronics Multi-C(16) 8 Homo STP-engine NEC electronics Multi-C(32) 8 Homo Kilocore Rapport Multi-C 8 Homo ADRES IMEC Multi-C(32) 16 Homo FE-GA Hitachi Multi-C 16 Hetero For Car-tuners SANYO Multi-C(4) 24 Homo Flex. Sword(SAKE) Toshiba Multi-C(4/16) 16 Homo Cluster Fujitsu Multi-C 16 Hetero
Outline • Why Dynamically Reconfigurable Processors ? – A solution of recent So. C problems. • What is a Dynamically Reconfigurable Processor ? – Coarse Grain Structure – Dynamic Reconfiguration – C-level programming • What is the main advantages/limitations? – Comparison with other architectures – Low power consumption • The 2 nd generation examples
Why Dynamic Reconfigurable Processors? CPU I/O Application Specific Hardware Memory So. C (System-on-a-Chip) Brain in Various IT products, e. g. Cellular Phones, Network Controllers, Mobile Terminals, Video camera, Car electronics… A solution to problems on So. C (System-on-a Chip) Problem! • The performance is depending on Application Specific Hardware • Various new techniques are coming up. • Design/mask cost of leading edge semiconductor process is much increased. Powerful but flexible, low power/cost off-load engine is required!
How about using common FPGAs? CPU I/O Application Common Specific FPGA Hardware Common FPGA is Flexible Xilinx’s FPGA (eg. Virtex-4/FX) with Power. PC are popularly used. Of course, Alteras’ are also popular. Memory But • System on a Programmable Device tends to be expensive and too much power consuming for most consumer products. • They come from their static fine grain architecture
What is a Dynamically Reconfigurable Processor ? Flexible Accelerators in So. Cs CPU I/O Dynamically Application Reconfigurable Specific Processor Hardware Coarse Grain Structure → High performance Dynamic reconfiguration → High area efficiency Memory C-level programming → Easy to design 1
Outline • Why Dynamically Reconfigurable Processors ? – A solution of recent So. C problems. • What is a Dynamically Reconfigurable Processor ? – Coarse Grain Structure – Dynamic Reconfiguration – C-level programming • What is the main advantages/limitations? – Comparison with other architectures – Low power consumption • The 2 nd generation examples
1. Coarse Grain Structure An example of PE array SE SE PE PE MULT SE SE PE PE PE SE SE Island style like FPGAs Various types of Array structures are used SE PE PE MEM SE SE PE PE MULT SE SE PE SE MEM Mu. CCRA-1 by Keio Univ (ASSCC 2007)
An example of PE (Processing Element) outc out rfboutcrfboutrfaoutcrfaout rfaddra RFile rfaddrb rfwe cnst rfwec dmuope rfinc rfina rfcsel rfsel outc out aluconf ALU SMU ina smuasel inb smubsel inc ina inb alucselaluaselalubsel aluina aluinca aluinb smuina rfinca rfinb rfincb 24 bit data 2 bit carry ALU: Add/Sub/Mult/CMP SMU: Shift/Mask/Constant RFile: Register Files PE of Mu. CCRA-1
2. Dynamic Reconfiguration • The operations of PEs and interconnections are defined by the configuration data stored in the configuration memory like FPGAs. • Changing configuration data dynamically → The data path for various applications can be switched quickly. • How configuration data are changed? – High speed delivery from the central configuration memory. – Multicontext dynamic reconfiguration → One clock dynamic reconfiguration
Quick delivery of instructions/configuration from on-chip memory PE/SE • Delivery with 10’s micro-seconds • PACT Xpp • Panasonic(Elixent’s) DFabric On-Chip Memory PE/SE Dynamically reconfiguration is done mainly for Task switching On-Chip Memory
Multicontext Function A number of Configuration Memory slots are provided. They can be switched in a clock → Hardware Structure is changed in a clock → Hardware Context switching PE/SE Multiplexer Output data 1 2 n SRAM slots Input data Context
Practical implementation of multicontext structure PE or Switcihng Element Context Memory Context Pointer
3. C-level programming • The programming environment is a mixture of traditional C compiler and FPGA design tool • The C-code is divided into the data flow and control. • The assignment of the contexts, PEs and memory modules can be automatically done. • The place-and-route sometimes takes a long time like FPGA design. • The programming is easy only if the data to be processed can be mapped onto the memory modules.
Example: DRP Compiler (NEC) • Compiling C source code into DRP object code Behavaioral Description Language (BDL) • High level synthesis: generates finite state machines (FSMs) and associated datapath planes – The ASIC behavioral design tool: Cyber is modified and used. • Mapper: maps FSMs and datapath plane to STC and PEs respectively • Place & Router: physically locates the PEs, memories and interconnection between them C Source Code High Level Synthesis FSM Datapath Technology Mapper Place & Router Code Generation Object Code
Outline • Why Dynamically Reconfigurable Processors ? – A solution of recent So. C problems. • What is a Dynamically Reconfigurable Processor ? – Coarse Grain Structure – Dynamic Reconfiguration – C-level programming • What is the main advantages/limitations? – Comparison with other architectures – Low power consumption • The 2 nd generation examples
Dynamically Reconfigurable Processors vs. other architectures vs. Multi-core/Many Core architectures – No instruction fetch/Cache mechanism – Less flexible but much smaller area → 16 PEs in 1. 5 mm-square/90 nm (Mu. CCRA 2) vs. SIMD (Single Instruction Streams Multiple Data Streams) – The operations and interconnections can be customized for each PE and SE. → Efficient for complicated algorithms. – The number of instructions/contexts are small vs. VLIW (Very Long Instruction Word) – A larger degree of parallelism can be utilized. → Higher performance can be obtained. – The number of instructions/contexts are small
Mu. CCRA-2 Floor Plan • ASPLA’s 90 nm • 2. 5 mm. X 2. 5 mm (Core: 1. 5 X 1. 5) The total PE array < one PE of Recent Multi/Many core processors 16
Dynamically Reconfigurable Processors vs. other architectures vs. Multi-core/Many Core architectures – No instruction fetch/Cache mechanism – Less flexible but much smaller area → 16 PEs in 1. 5 mm-square/90 nm (Mu. CCRA 2) vs. SIMD (Single Instruction Streams Multiple Data Streams) – The operations and interconnections can be customized for each PE and SE. → Efficient for complicated algorithms. – The number of instructions/contexts are small vs. VLIW (Very Long Instruction Word) – A larger degree of parallelism can be utilized. → Higher performance can be obtained. – The number of instructions/contexts are small
Granularity vs. Num. of Cores vs. Mum. of HW-contexts Dynamically Reconfigurable Processors DAPDNA-2 Granularity of core Multi-Core processor FPGA extension CS 2112 FE-GA FPGA Num. of Cores VLIW Common Processor 32 bit Xpp 16 bit DRL 1000 DFabric DRP Xbridge 100 8 bit 10 4 bit 1 3 8 16 32 Many Num. of HW-contexts
Main Advantage: Low power consumption Why low power ? 1. No redundant hardware – There are no instruction fetch mechanisms, cache, TLB, and etc. → Of course, it cannot be a general purpose engine, but enough for an accelerator. – A bare datapath works only for computation. 2. Parallel Execution with a number of PEs – Much lower clock frequency can be used to achieve the same performance as other architectures. – The main problem is leakage power, but can be suppressed by power gating techniques. 10 X energy efficient compared with DSPs. 5 -50 X with FPGAs. Sometimes similar to that for hardwired logic.
Energy consumption(n. J) The comparison using 0. 18 um implementation
The main limitations as an accelerator in So. Cs • The data must be stored in the memory modules placed around the PE array. – If the data is more than the memory, it is hard to be treated. • If the required contexts are more than its context memory, the operational speed is much degraded. – The virtual hardware mechanism is provided but there is a certain limitation. • The performance is not so improved for problems without parallelism.
Outline • Why Dynamically Reconfigurable Processors ? – A solution of recent So. C problems. • What is a Dynamically Reconfigurable Processor ? – Coarse Grain Structure – Dynamic Reconfiguration – C-level programming • What is the main advantages/limitations? – Comparison with other architectures – Low power consumption • The 2 nd generation examples
Dynamically Reconfigurable Processors: The 2 nd generation • Customized for a specific target application area – – SANYO car tuner → Tuner Fujitsu → Wireless communication Toshiba SAKE → Multi-media NEC electronics X-bridge → Multi-media • Multi-core structure with small PE arrays rather than a big array – Cooperation with various type cores • Integrated design environment • Low power design → The main advantage!
X-bridge: NEC electronics (2008) CPU MIPS I-C D-C INTC DMA STP Engine DMA SPL SPL 64 bit on chip bus (266 MHz) SPL SPL DMA Work PCIexp RAM HB/EP Periph (1 k. B) (1 -lane) I/F From Invited talk in Design Gaia. 2008 SPL Dynamically Reconfigurable Core 256 PE(8 bit) 32 -context Nconnect 64 bit Memory Switch (266 MHz) General Port 8 b. X 4 UART CSI GPIO JTAG Providing the virtual SPL hardware mechanism SPL DMA controller hides the communication overhead DMA 10/100 Ether MAC PCI Host/ Target DDR 2 SDRAM CTR
Mixture of SIMD and DRP units: Toshiba’s Flex. Sword Dynamically Reconfigurable Units Optimized for Stream Processing (Indenepndently Controlled) Our Architecture Host Processor Host I/F code data System Memory I/O Buffer (Data RAM) Code Buffer (Code RAM) Write Control Formatter 0 Inter-Unit Buffer (Data Registers) AUX 1 SIMD Units From FPT 2007 Tutorial session AUX 0 Formatter 1
The Architecture (Formatter) Cfg Controller Xbar In 128 Shuffle Simple Hardware • Pipeline registers only • No intra-PE data transfer • PE: 4 cfgs, Xbar: 16 cfgs • ALU, shift & absolute ops only From FPT 2007 Tutorial session data A data B 64 19 PE Cfg. Mem 16 -bit ALU x 8 Suitable for batterfly operations Code. Mem ID valid PE PE w/o Shuffle Xbar Out Xbar In: Formatter 0 only XBar Out: Formatter 1 only
SANYO’s Car tuner DRP ALU array command memory sequencer Feedback In ALU ALU ALU ALU ALU ALU main memory Out
Fine carrier frequency offset estimation/correction LT 1 I Q I Cluster 0 Q to FFT LT 2 I Q Cluster 0 Cluster 3 data out control Cluster 0 Reg Cluster 4 a) Fine carrier frequency offset estimation for LT 1 phase offset calculation Cluster 5 Cluster 6 Cluster 2 Reg in cluster 0 self-correlation I DIV ATAN Cluster 1 Q to FFT b) Fine carrier frequency offset estimation for LT 2 Cluster 1 Cluster 6 (through) correction offset calculation in phase polar I Q Cluster 2 complex multiply Cluster 3 data out control & clip I Q c) Fine carrier frequency offset correction for SIGNAL and DATA
Hitachi’s FE-GA Interrupt/DMA request Sequence Manager Computational Cell Array I/O port ALU MLT ALU Load/Store Cells LS MEM ALU ALU LS MEM MLT ALU ALU LS MEM ALU MLT ALU LS Bus MEM Interface LS MEM ALU LS MEM Local Memory Crossbar Network Configuration Manager
Heterogeneous Multi-Core using FE-GA CPU 0 SH-4 CPU 1 DRP 0 FE-GA DTU LPM LDM FVR DSM DRP 1 DTU Network Interface On-Chip CSM CPU 2 CPU 3 DRP 2 DRP 3 The codes are generated by a parallelizing compiler and standard APIs.
Summary • The 2 nd Generation Dynamically Reconfigurable Processors are going to be embedded into consumer electronics products. • The main advantage is low power consumption. • The main limitations is data memory → limited into a kind of stream computing. • Especially active in Japan – Major Japanese consumer electronics companies all try to develop such systems.
Thank you! A part of our own project will be presented in the later sessions Yes. Japanese Culture Loves Dynamic Reconfiguration!
PE architecture < Simple structure < Executable up to 4 instructions in parallel control bus 1 -bit x 4 8 -bit x 4 w/ valid bit Configuration Register (x 4) From upper Cell To upper Cell From left Cell From right Cell Arithmetic-1 Logical Flow Control SFT Shift THR Data Control Output Switch Transfer Register (TREG) From lower Cell Input Switch Delay Adjustment ALU To lower Cell To left Cell To right Cell
DRP Programming 1. Context switching 0 Data input 2. Parallel processing in a context 3. Sequential execution in a context 1 2 3 4 5 Data output Context is controlled with a state machine. 3 -dimensional flexibility. Functional optimizer works efficiently. Efficient C-level programming
Time multiplexed execution Target hardware Real hardware • A single task can be executed with multiple contexts. • Area becomes 1/n, but performance becomes also 1/n.
Time multiplexed execution Target Hardware Real Hardware Most of hardware works partially. → Area efficiency is improved!
A wide research field of reconfigurable architectures • Two major extremes of multiple-core architectures: – FPGAs • Fine-grained multiple-core architectures with huge number of cores • Basically static: 1 -hardware context WIDE RESEARCH FIELD – Many-core processors • Very coarse-grained multiple-core architectures • Fully programmable: Infinite-number of hardware contexts
Our environment for architectural exploration Mu. CCRA array design environment [FPL 07] Application Programs Architecture parameters Template Library Retargetable Compiler Black Diamond Test Bench and Test Vector DRPA Verilog-HDL Generator CMOS standard cell library Verilog HDL description Logic Synthesis Synopsys Design Compiler Netlist RTL/Net/Chip simulation (Cadence NC-Verilog) Placement and Routing Synopsys Astro GDSII 4 Netlist Timing Analysis (Synopsys Prime Time)
Extremely Low Power Design • Now, major benefit of Dynamically Reconfigurable Processors – 1/8 -1/10 to DSP [ASSCC 07] – The main reason why SONY uses VME (Virtual Mobile Engine) in PSP (Playstation Portable) and X-bridge in professional video systems. • Applying traditional techniques/Reducing the overhead of context switching [FPL 08] – Operand isolation is quite effective • Context oriented voltage control [Schweizer: FPT 07] • Fine-grained power gating [FPT 08 Poster] • Dual Vth
Network on Chips for reconfigurable systems • For inner-core connection – island style/direct interconnection – New style of interconnection? • For inter-core connection – The similar network for Many-core systems may be used? • Three dimensional/Wireless – A new possibility
3 Dimensional wireless connected dies: Mu. CCRA-Cube • A plane is corresponding to an array like Mu. CCRA-2 (4 × 4 PE) • 4 planes are connected with inductive wireless very high speed interconnection. (3 Gbit/sec per each channel) • Planes are connected in the flipped direction • 16 channels are provided in the 3 -D direction Direction of planes Channels
Mu. CCRA-Cube Prototype • STARC/ASPLA 90 nm • 2. 5 mm x 5 mm die • Verilog-HDL is used for design Transceiver (Data) • Synthesis: Synopsys Design. Compiler 2006. 06 -SP 2 • Place&Route: Synopsys Astro 2007. 03 SP 3 • Simulation: Cadence Verilog-XL 5. 7 CSC PE/SE DATA MEM Transceiver (CLK) TCC
Summary • There is a wide field for architectural exploration between FPGAs and Manycore processors • Keywords – Application Configurable – Low power Techniques – Interconnection Networks including Three dimensional/Wireless – Integrated Design Environment
IMEC ADRES Instruction Fetch Instruction Dispatch Instruction Decode Data Cache VLIW view RF FU FU FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF Reconfigurable Array View
Rapport Kilocore PE PE Input Controller Configuration Controller 32 bits 128 bits Fabric 16 PEs X 16 PEs 672 bits …. . PE Interconnect PE PE PE …. . PE Interconnect 128 bits Output Controller PE PE …. . Interconnect Stripe PE
Stretch S 5 engine Inst Cache Data Cache MMU Inst Unit Load/ Store Unit FR AR WR FPU ALU ISEF FP Unit Integer Unit Extension Unit
Implementation Screen Shot of Context Menu
Implementation Screen Shot of Code Menu Library Function User Function
Implementation Screen Shot of Pointer Output Input
Implementation Screen Shot of Mu. CCRA-1 Multiply Modules PE Switching Element Memory Modules
Implementation Screen Shot of Mu. CCRA-2 PE Switching Element Memory Modules
DRP Tile structure 2 port VMEM 8 bit × 256 entries HMEM VMEM PE PE PE PE PE PE PE PE VMEM VMEM ctrl State Transition Controller VMEM ctrl VMEM PE PE PE PE PE PE PE PE VMEM HMEM 1 port HMEM 8 bit × 8 K entries
Task and context control in Mu. CCRA[FPL 08] 0 1 2 3 ALU RFile ・ ・ ・ 63 SMU PE Context Memory Context Pointer Configuration Data (Contexts) Mu. CCRA PE Array TCC CSC (Context Switching Controller) Control Signals (Task Configuration Controller) Configuration Data Memory A C B D Target Tasks • Task Control • Context control – Multicontext switching with a Context Pointer 7 – Multiple tasks each of which is consisting of multiple contexts are loaded from the centralized memory – A Virtual Hardware Mechanism
Granularity vs. Num. of Cores vs. Mum. of HW-contexts DAPDNA-2 Multi-Core processor CS 2112 Granularity of core Num. of Cores FE-GA FPGA 32 bit 16 8 1000 DRP VLIW Common Processor Xbridge DRL 16 bit 100 8 bit 10 4 bit 1 3 8 16 32 Many Num. of HW-contexts
- Slides: 60