ECE 565 HighLevel SynthesisAn Introduction Shantanu Dutt ECE

  • Slides: 17
Download presentation
ECE 565 High-Level Synthesis—An Introduction Shantanu Dutt ECE Dept. , UIC

ECE 565 High-Level Synthesis—An Introduction Shantanu Dutt ECE Dept. , UIC

HLS Flow • Code/Algorithm Architecture (interconnected functional units (FUs), memory units (MUs) via muxes,

HLS Flow • Code/Algorithm Architecture (interconnected functional units (FUs), memory units (MUs) via muxes, demuxes, tristate buffers, buses, dedicated interconnects) Classically, these 3 stages were performed sequentially but currently performed together (which leads to better optimization)

HLS Flow (contd)

HLS Flow (contd)

HLS Flow (contd) Taken into consideration during register allocation (post scheduling). Taken into consideration

HLS Flow (contd) Taken into consideration during register allocation (post scheduling). Taken into consideration during scheduling. (Binding) Allocation: Simple counting of FUs after the above 2 stages

Simple HLS Examples +

Simple HLS Examples +

Simple HLS Examples (contd) 2) Mapping to h/w w/ constraints: use only 1 (X)

Simple HLS Examples (contd) 2) Mapping to h/w w/ constraints: use only 1 (X) and 1 (+) w/ X delay of 2 cc’s and + delay of 1 cc ldd ldc (a) Scheduling ldx i) Non-overlapped pipelined scheduling: Schedule an operation when i/p data and FU available (may need to break ties between competing operations) c 1(1) X + c 2(1) cc’s 1 c 1(2) c 3(1) c 2(2) 2 3 4 5 Controller FSM: Reset a b ldb X (b) Arch. Synthesis: Binding & FU, reg, mux/demux c 3(2) allocation and interconnection 6 cc 3 i+1 (c) Controller FSM Synthesis [y c+d] mux 1=0, (c 2) d mux mux 1 y I 1 I 0 I 1 mux 2 + demux O 1 O 0 z mux 2=0 demux=0, ldy=1 cc 3 i lda=1, ldb=1, Note: ldc=1, ldd=1, Unspecified mux 1=1, mux 2=1 control signals demux=1, (cs) have either ldz=1 an inactive value, or if such a concept [z x+y] doesn’t exist for (c 3) the cs, then the don’t-care value lda c x ldz Note: A register is loaded at the +ve/-ve edge (in a +ve/-ve edge triggered system) of the cc after the one in which its load signal is asseted. cc 3 i+2 ldx=1 [x a x b] (c 1) lda = 1 reg. “a” loaded ldy

Simple HLS Examples (contd) 2) Mapping to h/w w/ constraints: use only 1 (X)

Simple HLS Examples (contd) 2) Mapping to h/w w/ constraints: use only 1 (X) and 1 (+) (cont’d) ldd ldc (a) Scheduling ii) Overlapped pipelined scheduling X c 1(1) + cc’s 1 c 1(2) lda a (b) Arch. Synthesis ldb I 1 mux 1 d I 0 y I 1 mux ldy mux 2 + X c 2(1) c 3(1) c 2(2) c 3(2) 2 3 4 5 demux 6 cc 3 i+1 [z x+y, ] (c 3) Controller FSM: Reset b c x ldx cc 3 i lda=1, ldb=1, mux 1=0, mux 2=0 demux=0, ldy=1, ldx=1 [y c+d, x a x b] ((c 1, c 2) ldc=1, ldd=1, mux 1=1, mux 2=1, demux=1, ldz=1 demux (c) Controller FSM Synthesis z ldz • For 4 iterations, the overlapped schedule takes 9 cc’s versus 12 cc’s by the non-overlapped sched. • Overlap. sched: Time for n iterations = 2 n+1 Throughput = n/(2 n+1) ~ 0. 5 outputs/cc • Nonoverlap. sched: Time for n iterations = 3 n Throughput = n/3 n ~ 0. 33 outputs/cc ~ 34% throughput improvement using an overlapped schedule

Simple HLS Examples (contd) in 1 • Some DFG control operation nodes: T Condition

Simple HLS Examples (contd) in 1 • Some DFG control operation nodes: T Condition (T/F) F Selectot out • Conditional code: If (a > b) then c a-b; Else c b-a; • Possible DFGs corresponding to the above conditional code: • Note that the 2 subs in the left dfg does not mean an HLS algorithm will use 2 subtractors/adders. A good one will use 1, which will be shared in a mutually exclusive way between the two subs that are in two different sections of an if-thenelse in in 2 Distributor Condition (T/F) T out 1 F out 2

Simple HLS Examples (contd) • Iterative code: while (a > b) a a-b; b

Simple HLS Examples (contd) • Iterative code: while (a > b) a a-b; b a a 1 T sel F - c 2 a mux > T dist F c 1 Mux b’ + s xor ovfl = 1 -ve = 0 +ve a cin r 1 Demux 1 ldr 1 and (s xor ovfl) demux 0 final a (b) Arch. Synthesis Scheduling & binding: + cc’s c 1 c 2 c 1 b’+1 = 2’s compl. of -b 1 ldfina (a) Scheduling (using only 1 adder/sub) c 2 b 0 To fsm Initialized to F ldb lda

Delay Nodes in DFGs A delay node is generally implemented as a register (or

Delay Nodes in DFGs A delay node is generally implemented as a register (or a series of registers if clock period < T 0); a delay node thus becomes a state variable.

Delay Nodes in DFGs (contd) register Transformation in the DFG Mapping to the architecture

Delay Nodes in DFGs (contd) register Transformation in the DFG Mapping to the architecture w/ the register decoupling input and output s. t. register i/p = o/p of combinational part and register o/p = i/p of combinational part, and these can be treated as independent of each other as their availabilities are in different time steps (e. g. , clock cycles)

Detailed HLS Example

Detailed HLS Example

Detailed HLS Example (contd) Different paths (i/p o/p) in the DFG Scheduling heuristic: Among

Detailed HLS Example (contd) Different paths (i/p o/p) in the DFG Scheduling heuristic: Among available opers schedule those on available FUs whose delay to o/p is the highest, breaking ties in favor of those opers u whose “sibling” o/ps (o/ps to the same children) that are avail. or will be available closest to u’s earliest finish (i. e. , asap time of child is earliest), otherwise the FU(s) will be idle unnecessary leading to a larger latency (this will also reduce lifetimes of sibling o/ps). (a) Scheduling w/ one X (2 cc’s) & one + (1 cc); Goal: Miinimize latency (c) Arch. synthesis The synthesized architecture (b) Reg. alloc. for o/p of operations For WAR constraint [can’t store in d 1 as would be natural, as d 1’s current data yet to be consumed by c 6 which has not been scheduled yet] Note: Above register allocation for adder has been done w/ separate regs for multiplier and adder o/ps. It is sub-optimal (4 non-primary i/p regs. needed)

Detailed HLS Example (contd)

Detailed HLS Example (contd)

Detailed HLS Example—Register Allocation

Detailed HLS Example—Register Allocation

Detailed HLS Example—Register Allocation (contd) Scheduling heuristic: As stated earlier d 0 3 non-primary

Detailed HLS Example—Register Allocation (contd) Scheduling heuristic: As stated earlier d 0 3 non-primary i/p regs. needed • In the conflict graph (one per FU [as here] if regs are grouped by FU, else one per FU type if regs are shared across each FU type or only one [global] if regs are shared across FUs), there is an edge between 2 variable nodes if their lifetimes overlap (indicating that different registers need to be allocated to them) • Graph coloring—using min. # of colors to color node s. t. connected node pairs have different colors—in general is NP-hard • The above type of conflict graph is called an interval graph (derived from a 1 -dimensional interval of the lifetimes) • Min. graph coloring can be solved optimally in linear time for interval graphs (using the left-edge algorithm that we will see later for channel routing)

Detailed HLS Example—Register Allocation (contd) d 0 3 non-primary i/p regs. needed Scheduling heuristic:

Detailed HLS Example—Register Allocation (contd) d 0 3 non-primary i/p regs. needed Scheduling heuristic: Among available opers schedule those on available FUs whose delay to o/p is the highest, breaking ties arbitrarily: B’s lifetime increases, but D’s (dep. of B) decreases similarly—heuristic should be based on more global information