Automated Design of Custom Architecture Tulika Mitra http

  • Slides: 34
Download presentation
Automated Design of Custom Architecture Tulika Mitra http: //www. comp. nus. edu. sg/~tulika

Automated Design of Custom Architecture Tulika Mitra http: //www. comp. nus. edu. sg/~tulika

Motivation n n Embedded system is designed for a specific application or a class

Motivation n n Embedded system is designed for a specific application or a class of applications Design a processor for an application domain Ø Ø n n Processor ISA and micro-architecture optimized for that application domain Multiple optimization criteria: time, power, cost, . . Stringent time-to-market constraint Design of domain specific processor should be (semi)-automated 2

Architecture Synthesis Application Power Size Automatic Tool Performance Timing Customized Architecture 3

Architecture Synthesis Application Power Size Automatic Tool Performance Timing Customized Architecture 3

Tensilica: Xtensa Architecture Copyright: Tensilica 4

Tensilica: Xtensa Architecture Copyright: Tensilica 4

Design Framework Copyright: Tensilica 5

Design Framework Copyright: Tensilica 5

Design Framework: Key Steps n n Instantiate parameters for the core processor to optimize

Design Framework: Key Steps n n Instantiate parameters for the core processor to optimize performance, power, or cost Identify useful and feasible ISA extensions Implement the domain specific processor Implement compilers, assembler, simulator, debugger, … 6

Silicon Choices n ASIC implementation of Processor Ø Ø Ø n Fast but not

Silicon Choices n ASIC implementation of Processor Ø Ø Ø n Fast but not flexible Time intensive design process Example: Tensilica, HP-STMicroelectronics Processor core in ASIC but instruction set extensions in reconfigurable logic Ø Ø Ø Medium speed but flexible Fast design process Example: Triscend Configurable System-on-Chip 7

Triscend Configurable So. C Copyright: Triscend 8

Triscend Configurable So. C Copyright: Triscend 8

Reconfigurable Computing 101 n n Higher performance than software with higher level of flexibility

Reconfigurable Computing 101 n n Higher performance than software with higher level of flexibility than hardware e. g. Field Programmable Gate Arrays (FPGA) Ø Logic Blocks • Ø Interconnection • Ø n Array of computational elements whose functionality is determined through multiple SRAM configuration bits Logic blocks are connected using programmable routing resources Any custom circuit can be mapped to FPGA by computing logic functions within logic blocks and using configurable routing to connect the logic blocks together Dynamically reconfigurable Logic Ø Ø Ø Logic reconfiguration during application execution Temporal partitioning of software reduces logic area Overhead for reconfiguration 9

Use of Reconfigurable Computing n Two choices Ø Ø n Map both control and

Use of Reconfigurable Computing n Two choices Ø Ø n Map both control and datapath to RC Map only datapath to RC Granularity of reconfigurable logic Ø Ø Ø Bit Multiple bits ALU 10

RC Coupled to I/O System Bus CPU Memory Local Bus Local PCIBus RC n

RC Coupled to I/O System Bus CPU Memory Local Bus Local PCIBus RC n n n I/O Most common form of commercial RC Overhead of data transfer between CPU and RC Requires large granularity of computation on RC 11

RC Coupled to Local Bus CPU Memory Local Bus PCI Bus RC n n

RC Coupled to Local Bus CPU Memory Local Bus PCI Bus RC n n I/O Pilchard from Chinese University of Hong Kong Still requires large granularity of computation 12

PICO Architecture Copyright: Bob Rau et. al. 13

PICO Architecture Copyright: Bob Rau et. al. 13

Design Framework Copyright: Bob Rau et. al. 14

Design Framework Copyright: Bob Rau et. al. 14

Design Flow Copyright: Bob Rau et. al. 15

Design Flow Copyright: Bob Rau et. al. 15

Hardware/software Co-design n n Well studied problem. Then what’s new? High Level Synthesis (HLS)

Hardware/software Co-design n n Well studied problem. Then what’s new? High Level Synthesis (HLS) Ø Ø n Spatial and temporal partitioning Ø Ø n Time-to-market constraint forces automated generation of reconfigurable bitstream from high level specification or software Automated generation of interface Partitioning among multiple configurable devices Map a function that exceeds the available space of reconfigurable device using time sharing Requires new compilation techniques 16

High Level Synthesis-1 n n High level hardware description language Start from software programming

High Level Synthesis-1 n n High level hardware description language Start from software programming language and add support for Ø Ø Ø n Make current HDL more abstract Ø n Parallelism via threads Message passing Examples: Handel-C, System. C Superlog, System Verilog Still requires user to find parallelism 17

High Level Synthesis-2 n n Combine research in two different fields: compiler and design

High Level Synthesis-2 n n Combine research in two different fields: compiler and design automation Traditional HLS techniques target ASIC implementation Ø Ø RC does not have the layout freedom Objective of RC is to minimize execution time Temporal partitioning if insufficient area Hardware library of operators or structures commonly used by software programs 18

High Level Synthesis-3 n n Concentrate on loops Leverage parallelizing compiler technology combined with

High Level Synthesis-3 n n Concentrate on loops Leverage parallelizing compiler technology combined with high level synthesis Ø Ø Ø Parallelize computation Optimize external memory access Loop transformation: Area versus Performance • • • n Unroll and Jam Loop unrolling Software pipelining Loop-invariant code motion Data layout Hardware specific optimizations Ø Bitwidth reduction 19

RC Coupled to CPU as Coprocessor CPU Memory Local Bus PCI Bus RC n

RC Coupled to CPU as Coprocessor CPU Memory Local Bus PCI Bus RC n n n I/O Tight coupling between CPU and RC RC can execute ISA extensions CPU and RC cannot share register file 20

RC Integrated in Processor Datapath RC CPU Memory Local Bus PCI Bus Local Bus

RC Integrated in Processor Datapath RC CPU Memory Local Bus PCI Bus Local Bus RC n n n I/O Most tight coupling between CPU and RC RC implements custom functional units for ISA extensions CPU and RC share register file 21

Custom Functional Unit Register File ADD MUL LD/ST CFU 1 Memory n Typically no

Custom Functional Unit Register File ADD MUL LD/ST CFU 1 Memory n Typically no restriction of number of input and output for CFU 22

Changing Role of Compiler n n n Standard compiler generates code for fixed ISA

Changing Role of Compiler n n n Standard compiler generates code for fixed ISA and micro-architecture Retargetable compiler accepts ISA + microarchitecture as input and generates code Compiler for domain specific processor Ø Ø n n First search and define the optimal ISA Generate code for the optimal ISA Defining optimal ISA requires hardware knowledge but is more similar to traditional s/w compiler analysis than h/w synthesis This process will work for ASIC as well, but Ø Ø No dynamic reconfigurability Different choice of ISA due to speed difference 23

Instruction Set Extension a b + c n d x - m n FU

Instruction Set Extension a b + c n d x - m n FU (i 1, i 2, i 3) = (i 1+i 2) x i 3 Two options to identify instruction set extensions Ø Ø Static data flow graph (DFG) Dynamic execution trace 24

Static Data Flow Graph n n Identify a special sub-graph called Max. MISO to

Static Data Flow Graph n n Identify a special sub-graph called Max. MISO to be collapsed into instruction set extension Max. MISO: A maximal multiple input single output sub-graph + + << >> >> | | + << Correct + Incorrect 25

Max. MISO Limitations n n n Only multiple input single output FU Cannot go

Max. MISO Limitations n n n Only multiple input single output FU Cannot go beyond control flow boundaries Execution frequency not taken into account No dynamic reconfigurability Max. MISO may not be the optimal choice Ø Ø Too big sub-graph Too small sub-graph 26

Dynamic Execution Trace n Identify and isolate frequently occurring patterns of operations Ø Ø

Dynamic Execution Trace n Identify and isolate frequently occurring patterns of operations Ø Ø Ø n n Pattern matching algorithm On-the-fly construction of pattern library Select matches for minimum cover Evaluate the most frequently occurring operation patterns in terms of how useful it would be to implement them as custom operations Downside: High complexity algorithm 27

Pattern Construction & Matching a P 1 * b * c P 2 P

Pattern Construction & Matching a P 1 * b * c P 2 P 3 + ld * d + ld e f P 1 (a, b, c) P 2 (c, d, e) P 3 (d, f) P 4 + P 4 (a, b, d, e) Pattern Library 28

Dynamic Reconfiguration Embedded Processor Cache FU FPGA n n FU Reconfigurable Interconnection FU n

Dynamic Reconfiguration Embedded Processor Cache FU FPGA n n FU Reconfigurable Interconnection FU n RG RG FU Functional units organized around interconnection network creates programmable datapath Synthesize a datapath for each loop 29 Small reconfiguration time due to coarse logic blocks

Datapath Merging n n n Merge the different loop datapaths into a single reconfigurable

Datapath Merging n n n Merge the different loop datapaths into a single reconfigurable datapath Reuse hardware blocks and interconnections across the loop datapaths as much as possible Datapath Merging Problem Ø Identify similarities among loop datapaths and produce a merged datapath with minimum number of hardware blocks and interconnects 30

Datapath Merging: Example a c a b + Merge d b + c d

Datapath Merging: Example a c a b + Merge d b + c d x - m n + x - 31

Datapath Merging Algorithm-1 A 11 B 11 A 21 A 12 C 11 A

Datapath Merging Algorithm-1 A 11 B 11 A 21 A 12 C 11 A 22 A 11 B 11 A 21 B 21 A 11 C 11 A 23 C 21 B 21 C 21 A 11 B 11 A 12 C 11 A 22 B 21 A 11 C 11 A 23 C 21 A 23 B 11 C 11 B 21 C 21 A 11 B 11 A 22 B 21 A 12 C 11 A 23 C 21 A 21 B 21 Find maximum clique in compatibility graph B 21 C 21 A 22 32

Datapath Merging Algorithm-2 A 11 A 12 A 11 B 11 A 21 B

Datapath Merging Algorithm-2 A 11 A 12 A 11 B 11 A 21 B 21 B 11 C 11 A 11 C 11 A 23 C 21 A 22 A 23 B 11 C 11 B 21 C 21 A 11 B 11 A 22 B 21 A 12 C 11 A 23 C 21 B 21 C 21 A 11 A 22 B 11 B 21 C 11 C 21 A 12 A 23 Find maximum clique in compatibility graph 33

Summary n n n High level synthesis technique for loop is not yet mature

Summary n n n High level synthesis technique for loop is not yet mature Very little research on compilation technique for dynamic reconfigurability of loops Instruction set extension: Ø Ø n Static DFG based technique has limitations Dynamic trace based technique too slow Techniques for dynamic reconfiguration of custom FU yet to be developed 34