Automated Design of Custom Architecture Tulika Mitra http
- Slides: 34
Automated Design of Custom Architecture Tulika Mitra http: //www. comp. nus. edu. sg/~tulika
Motivation n n Embedded system is designed for a specific application or a class of applications Design a processor for an application domain Ø Ø n n Processor ISA and micro-architecture optimized for that application domain Multiple optimization criteria: time, power, cost, . . Stringent time-to-market constraint Design of domain specific processor should be (semi)-automated 2
Architecture Synthesis Application Power Size Automatic Tool Performance Timing Customized Architecture 3
Tensilica: Xtensa Architecture Copyright: Tensilica 4
Design Framework Copyright: Tensilica 5
Design Framework: Key Steps n n Instantiate parameters for the core processor to optimize performance, power, or cost Identify useful and feasible ISA extensions Implement the domain specific processor Implement compilers, assembler, simulator, debugger, … 6
Silicon Choices n ASIC implementation of Processor Ø Ø Ø n Fast but not flexible Time intensive design process Example: Tensilica, HP-STMicroelectronics Processor core in ASIC but instruction set extensions in reconfigurable logic Ø Ø Ø Medium speed but flexible Fast design process Example: Triscend Configurable System-on-Chip 7
Triscend Configurable So. C Copyright: Triscend 8
Reconfigurable Computing 101 n n Higher performance than software with higher level of flexibility than hardware e. g. Field Programmable Gate Arrays (FPGA) Ø Logic Blocks • Ø Interconnection • Ø n Array of computational elements whose functionality is determined through multiple SRAM configuration bits Logic blocks are connected using programmable routing resources Any custom circuit can be mapped to FPGA by computing logic functions within logic blocks and using configurable routing to connect the logic blocks together Dynamically reconfigurable Logic Ø Ø Ø Logic reconfiguration during application execution Temporal partitioning of software reduces logic area Overhead for reconfiguration 9
Use of Reconfigurable Computing n Two choices Ø Ø n Map both control and datapath to RC Map only datapath to RC Granularity of reconfigurable logic Ø Ø Ø Bit Multiple bits ALU 10
RC Coupled to I/O System Bus CPU Memory Local Bus Local PCIBus RC n n n I/O Most common form of commercial RC Overhead of data transfer between CPU and RC Requires large granularity of computation on RC 11
RC Coupled to Local Bus CPU Memory Local Bus PCI Bus RC n n I/O Pilchard from Chinese University of Hong Kong Still requires large granularity of computation 12
PICO Architecture Copyright: Bob Rau et. al. 13
Design Framework Copyright: Bob Rau et. al. 14
Design Flow Copyright: Bob Rau et. al. 15
Hardware/software Co-design n n Well studied problem. Then what’s new? High Level Synthesis (HLS) Ø Ø n Spatial and temporal partitioning Ø Ø n Time-to-market constraint forces automated generation of reconfigurable bitstream from high level specification or software Automated generation of interface Partitioning among multiple configurable devices Map a function that exceeds the available space of reconfigurable device using time sharing Requires new compilation techniques 16
High Level Synthesis-1 n n High level hardware description language Start from software programming language and add support for Ø Ø Ø n Make current HDL more abstract Ø n Parallelism via threads Message passing Examples: Handel-C, System. C Superlog, System Verilog Still requires user to find parallelism 17
High Level Synthesis-2 n n Combine research in two different fields: compiler and design automation Traditional HLS techniques target ASIC implementation Ø Ø RC does not have the layout freedom Objective of RC is to minimize execution time Temporal partitioning if insufficient area Hardware library of operators or structures commonly used by software programs 18
High Level Synthesis-3 n n Concentrate on loops Leverage parallelizing compiler technology combined with high level synthesis Ø Ø Ø Parallelize computation Optimize external memory access Loop transformation: Area versus Performance • • • n Unroll and Jam Loop unrolling Software pipelining Loop-invariant code motion Data layout Hardware specific optimizations Ø Bitwidth reduction 19
RC Coupled to CPU as Coprocessor CPU Memory Local Bus PCI Bus RC n n n I/O Tight coupling between CPU and RC RC can execute ISA extensions CPU and RC cannot share register file 20
RC Integrated in Processor Datapath RC CPU Memory Local Bus PCI Bus Local Bus RC n n n I/O Most tight coupling between CPU and RC RC implements custom functional units for ISA extensions CPU and RC share register file 21
Custom Functional Unit Register File ADD MUL LD/ST CFU 1 Memory n Typically no restriction of number of input and output for CFU 22
Changing Role of Compiler n n n Standard compiler generates code for fixed ISA and micro-architecture Retargetable compiler accepts ISA + microarchitecture as input and generates code Compiler for domain specific processor Ø Ø n n First search and define the optimal ISA Generate code for the optimal ISA Defining optimal ISA requires hardware knowledge but is more similar to traditional s/w compiler analysis than h/w synthesis This process will work for ASIC as well, but Ø Ø No dynamic reconfigurability Different choice of ISA due to speed difference 23
Instruction Set Extension a b + c n d x - m n FU (i 1, i 2, i 3) = (i 1+i 2) x i 3 Two options to identify instruction set extensions Ø Ø Static data flow graph (DFG) Dynamic execution trace 24
Static Data Flow Graph n n Identify a special sub-graph called Max. MISO to be collapsed into instruction set extension Max. MISO: A maximal multiple input single output sub-graph + + << >> >> | | + << Correct + Incorrect 25
Max. MISO Limitations n n n Only multiple input single output FU Cannot go beyond control flow boundaries Execution frequency not taken into account No dynamic reconfigurability Max. MISO may not be the optimal choice Ø Ø Too big sub-graph Too small sub-graph 26
Dynamic Execution Trace n Identify and isolate frequently occurring patterns of operations Ø Ø Ø n n Pattern matching algorithm On-the-fly construction of pattern library Select matches for minimum cover Evaluate the most frequently occurring operation patterns in terms of how useful it would be to implement them as custom operations Downside: High complexity algorithm 27
Pattern Construction & Matching a P 1 * b * c P 2 P 3 + ld * d + ld e f P 1 (a, b, c) P 2 (c, d, e) P 3 (d, f) P 4 + P 4 (a, b, d, e) Pattern Library 28
Dynamic Reconfiguration Embedded Processor Cache FU FPGA n n FU Reconfigurable Interconnection FU n RG RG FU Functional units organized around interconnection network creates programmable datapath Synthesize a datapath for each loop 29 Small reconfiguration time due to coarse logic blocks
Datapath Merging n n n Merge the different loop datapaths into a single reconfigurable datapath Reuse hardware blocks and interconnections across the loop datapaths as much as possible Datapath Merging Problem Ø Identify similarities among loop datapaths and produce a merged datapath with minimum number of hardware blocks and interconnects 30
Datapath Merging: Example a c a b + Merge d b + c d x - m n + x - 31
Datapath Merging Algorithm-1 A 11 B 11 A 21 A 12 C 11 A 22 A 11 B 11 A 21 B 21 A 11 C 11 A 23 C 21 B 21 C 21 A 11 B 11 A 12 C 11 A 22 B 21 A 11 C 11 A 23 C 21 A 23 B 11 C 11 B 21 C 21 A 11 B 11 A 22 B 21 A 12 C 11 A 23 C 21 A 21 B 21 Find maximum clique in compatibility graph B 21 C 21 A 22 32
Datapath Merging Algorithm-2 A 11 A 12 A 11 B 11 A 21 B 21 B 11 C 11 A 11 C 11 A 23 C 21 A 22 A 23 B 11 C 11 B 21 C 21 A 11 B 11 A 22 B 21 A 12 C 11 A 23 C 21 B 21 C 21 A 11 A 22 B 11 B 21 C 11 C 21 A 12 A 23 Find maximum clique in compatibility graph 33
Summary n n n High level synthesis technique for loop is not yet mature Very little research on compilation technique for dynamic reconfigurability of loops Instruction set extension: Ø Ø n Static DFG based technique has limitations Dynamic trace based technique too slow Techniques for dynamic reconfiguration of custom FU yet to be developed 34
- Tulika mitra
- Right sided vs left sided murmurs
- Dr tulika jain
- What is aldep
- Custom single purpose processor design
- Pt among utama
- Chiranjib mitra
- Pt dmh
- Bal mitra gram project
- Ashim mitra
- Chitra mitra
- Envigo
- Types of cholera
- Piispan päähine
- Mitra nejad
- Hubungan gereja dengan dunia
- Ihs significado satánico
- Subarna mitra
- Dr debasis mitra
- Mitra purandare
- Mitra rocca
- Mitra janes
- Dr rito mitra
- Uday mitra iisc
- Rito mitra
- Dr subrata mitra
- Mitra amini
- Model komunitas sebagai mitra
- Mitra kazemi
- Bivas mitra
- Bivas mitra
- Mira mitra iit kgp
- Http //mbs.meb.gov.tr/ http //www.alantercihleri.com
- Siat.ung.ac.id
- Apache server architecture