Platform Design ASIP Application Specific Instructionset Processor TUe

  • Slides: 30
Download presentation
Platform Design ASIP Application Specific Instruction-set Processor TU/e 5 kk 70 Henk Corporaal Bart

Platform Design ASIP Application Specific Instruction-set Processor TU/e 5 kk 70 Henk Corporaal Bart Mesman 6/17/2021 Platform Design H. Corporaal and B. Mesman 1

Application domain specific processors (ADSP or ASIP) DSP Programmable CPU Programmable DSP Application domain

Application domain specific processors (ADSP or ASIP) DSP Programmable CPU Programmable DSP Application domain specific Application specific processor flexibility efficiency 6/17/2021 Platform Design H. Corporaal and B. Mesman 2

Application domain specific processors (ADSP or ASIP) takes a well defined application domain as

Application domain specific processors (ADSP or ASIP) takes a well defined application domain as a starting point • exploits characteristics of the domain (computation kernels) • still programmable within the domain e. g. MPEG 2 coding uses 8*8 DCT transform, DECT, GSM etc. . . implementation GP Appl. domain performance: clock speed + ILP flexible dev. (new apps. ) problems manual design, large effort 6/17/2021 Platform Design Appl. domain ADSP implementation ILP + tuning to domain cost effective (high volume) - specification - design time and effort => synthesized cores H. Corporaal and B. Mesman 3

www. adelantetech. com 6/17/2021 Platform Design H. Corporaal and B. Mesman 4

www. adelantetech. com 6/17/2021 Platform Design H. Corporaal and B. Mesman 4

Outline • design process • retargetable code generation (problem statement) • ADSP/VLIW architectures (Mistral

Outline • design process • retargetable code generation (problem statement) • ADSP/VLIW architectures (Mistral 2 /A|RT designer) • low power aspects (Mistral 2 /A|RT designer) • discussion • conclusion 6/17/2021 Platform Design H. Corporaal and B. Mesman 5

Design process application(s) instance processor model e. g. VLIW with shared RFs parameters SW

Design process application(s) instance processor model e. g. VLIW with shared RFs parameters SW (code generation) HW design Estimations nsec/cycle, area, power/instr Estimations cycles/alg occupation OK? yes 6/17/2021 more appl. ? Platform Design no no 3 phases 1. exploration 2. hw design (layout) + processing 3. design appl. sw Fast, accurate and early feedback go to phase 2 H. Corporaal and B. Mesman 6

Problem statement A compiler is retargetable if it can generate code for a ‘new’

Problem statement A compiler is retargetable if it can generate code for a ‘new’ processor architecture specified in a machine description file. A guarded register transfer pattern (GRTP) is a register transfer pattern (RTP) together with the control bits of the instruction word that control the RTP. a: = b + c | instr = xxxx 0101 GRTPs contain all inter-RT-conflict information. Instruction set extraction (ISE) is the process of generating all possible GRTPs for a specific processor. 6/17/2021 Platform Design H. Corporaal and B. Mesman 7

Problem statement Algorithm spec Processor spec (instance) FE ISE in ch 4 this is

Problem statement Algorithm spec Processor spec (instance) FE ISE in ch 4 this is part of the code generator CDFG GRTP Code Generation Machinecode 6/17/2021 Platform Design H. Corporaal and B. Mesman 8

Example: Simple processor [Leupers] Inp I. (20: 13) PC I. (12: 5) RAM I.

Example: Simple processor [Leupers] Inp I. (20: 13) PC I. (12: 5) RAM I. (4) +1 I. (3: 2) IM I. (1: 0) I. (20: 0) REG outp 6/17/2021 Platform Design H. Corporaal and B. Mesman 9

Example: Simple processor [Leupers] 6/17/2021 Platform Design H. Corporaal and B. Mesman 10

Example: Simple processor [Leupers] 6/17/2021 Platform Design H. Corporaal and B. Mesman 10

ASIP/VLIW architectures A|RT designer template as an example (= set of rules, a model)

ASIP/VLIW architectures A|RT designer template as an example (= set of rules, a model) Differences with VLIW processors of ch. 4 1. // FUs • ASUs = complex appl. Spec. FUs (beyond subword //) e. g. biquad, median, DCT etc … • larger grainsize, more heterogeneous, more pipelines 2. Rfiles • many Rfiles (>5 vs 1 or 2) • limited # ports (3 vs 15) • limited size (<16 vs. 128) 3. Issue slots • all in parallel vs. 5 6/17/2021 Platform Design H. Corporaal and B. Mesman 11

RF 1 RF 2 FU 1 RF 3 RF 4 FU 2 RF 5

RF 1 RF 2 FU 1 RF 3 RF 4 FU 2 RF 5 RF 6 RF 7 FU 3 RF 8 FU 4 flags IR 1 IR 2 IR 3 Instruction memory 6/17/2021 Platform Design H. Corporaal and B. Mesman IR 4 Control 12

ASIP/VLIW architectures Additional characteristics of the A|RT designer template • interconnect network: busses +

ASIP/VLIW architectures Additional characteristics of the A|RT designer template • interconnect network: busses + input multiplexers mux control is part of the instruction control can change every clock cycle network can be incomplete busses can be merged • memories are modeled as FUs separate data in and data out 2 inputs (data in and address) and 1 output • Each FU can generate one or more flags • instruction format (per issue slot) read write mux 1 mux 2 address RF 1 RF 2 6/17/2021 Platform Design H. Corporaal and B. Mesman control FU output drivers 13

ASIP/VLIW architectures: example RF 1 RF 2 ALU bus 1 19 mux read write

ASIP/VLIW architectures: example RF 1 RF 2 ALU bus 1 19 mux read write 2 RF 1 6/17/2021 read RF 2 RF 3 write RF 2 MAC 10 9 ALU instr. mux 3 Platform Design RF 4 bus 2 0 read RF 3 write RF 3 read write MAC instr. RF 4 H. Corporaal and B. Mesman 14

ASIP/VLIW architectures : example 6/17/2021 Platform Design H. Corporaal and B. Mesman 15

ASIP/VLIW architectures : example 6/17/2021 Platform Design H. Corporaal and B. Mesman 15

ASIP/VLIW architectures: design flow assign ( a+b, ALU, fu_alu 1) assign ( a+_, ALU,

ASIP/VLIW architectures: design flow assign ( a+b, ALU, fu_alu 1) assign ( a+_, ALU, fu_alu 2) assign ( _+_, ALU, fu_alu 3) Algorithm spec Datapath synthesis RF 1 : x = RF 2 : y, RF 3 : z | ALU = ADD Inmux = bus 2 Change RTs pragmas Controller synthesis VLIW makes relatively simple code selection possible 6/17/2021 Estimations area, power, timing Platform Design OK? no yes H. Corporaal and B. Mesman 16

ASIP/VLIW architectures: list scheduling Candidate LIST IPB * + 1 2 * OPB *

ASIP/VLIW architectures: list scheduling Candidate LIST IPB * + 1 2 * OPB * 4 + 0 0 * 3 * + 1 1 1 5 * 2 * 3 * * * 1 * 4 Scheduled Operation * 3 + 1 2 * 4 + 3 6 * 2 3 2 Conflict & Priority Comp. 4 * 6 + 3 6 MULT + 7 * 3 3 * + 5 8 * 7 * 8 * 5 * 8 + 8 7 ALU * IPB + 9 10 OPB 6/17/2021 Platform Design 4 4 * * 5 5 * * 9 * 5 + 9 * 9 5 * 10 H. Corporaal and B. Mesman + 9 10 17

ASIP/VLIW architectures: feedback 6/17/2021 Platform Design H. Corporaal and B. Mesman 18

ASIP/VLIW architectures: feedback 6/17/2021 Platform Design H. Corporaal and B. Mesman 18

Outline • design process • retargetable code generation (problem statement) • ADSP/VLIW architectures (Mistral

Outline • design process • retargetable code generation (problem statement) • ADSP/VLIW architectures (Mistral 2 /A|RT designer) • low power aspects (Mistral 2 /A|RT designer) • discussion • conclusion 6/17/2021 Platform Design H. Corporaal and B. Mesman 19

Low power aspects Implementation Independent Design Database • Estimation area speed + power Mistral

Low power aspects Implementation Independent Design Database • Estimation area speed + power Mistral 2 Architecture 6/17/2021 Platform Design Estimation Database H. Corporaal and B. Mesman 20

GSM viterbi decoder : default solution 13750 EXU alu_1 romctrl_1 acu_1 ipb_1 opb_1 ctrl

GSM viterbi decoder : default solution 13750 EXU alu_1 romctrl_1 acu_1 ipb_1 opb_1 ctrl total ACTIV 96% 48% 26% 5% 23% AREA 3469 39 327 131 1804 9821 15591 POWER 46196 259 1209 105 5801 135035 188605 • controller responsible for 70% of power consumption – maximum resource-sharing – heavy decision-making : “main” loop with 16 metrics-computations per iteration • EXU-numbers include Registers for local storage 6/17/2021 Platform Design H. Corporaal and B. Mesman 21

GSM viterbi decoder : no loop-folding 14247 EXU alu_1 romctrl_1 acu_1 ipb_1 opb_1 ctrl

GSM viterbi decoder : no loop-folding 14247 EXU alu_1 romctrl_1 acu_1 ipb_1 opb_1 ctrl total ACTIV 92% 45% 25% 5% 22% AREA 3411 39 294 107 1661 4919 10431 POWER 45073 255 1087 86 5340 70087 121928 • area down by 33% • power down by 35% • next step: reduce # of program-steps with second ALU 6/17/2021 Platform Design H. Corporaal and B. Mesman 22

GSM viterbi decoder : 2 ALU’s 9739 EXU alu_1 alu_2 romctrl_1 acu_1 ipb_1 opb_1

GSM viterbi decoder : 2 ALU’s 9739 EXU alu_1 alu_2 romctrl_1 acu_1 ipb_1 opb_1 ctrl total ACTIV 69% 65% 67% 37% 8% 33% AREA 1797 1393 39 294 149 2136 8957 14766 POWER 12248 8916 255 1087 119 6871 87235 116731 © cycle count down 30% © area up 42% © power down by 5% © next step: introduce ASU to reduce ALU-load 6/17/2021 Platform Design H. Corporaal and B. Mesman 23

GSM viterbi decoder : 1 x ACS-ASU func ACS ( M 1, M 2,

GSM viterbi decoder : 1 x ACS-ASU func ACS ( M 1, M 2, d ) MS, MS 8 = begin MS = if ( M 1+d > M 2 -d ) -> ( M 1+d) || ( M 2 -d) fi; MS 8 = if ( M 1 - d > M 2+d) -> ( M 1 - d) || ( M 2+d) fi; end; EXU alu_1 acs_asu_1 or_asu_1 romctrl_1 acu_1 ipb_1 opb_1 ctrl total ACTIV 20% 83% 10% 16% 36% 20% 11% AREA 261 2382 611 65 294 107 163 1864 5747 = POWER 105 3816 122 21 205 43 35 3597 7944 1930 © cycle count down 5 X © power down 20 X ! 6/17/2021 Platform Design H. Corporaal and B. Mesman 24

GSM viterbi decoder : 4 x ACS-ASU EXU alu_1 acs_asu_2 acs_asu_3 acs_asu_4 split_asu_1 or_asu_1

GSM viterbi decoder : 4 x ACS-ASU EXU alu_1 acs_asu_2 acs_asu_3 acs_asu_4 split_asu_1 or_asu_1 romctrl_1 acu_1 ipb_1 opb_1 ctrl total 425 ACTIV 94% 95% 95% 47% 28% 98% 23% 50% AREA 243 1041 90 592 48 212 60 369 1306 7084 POWER 97 420 420 18 118 6 85 6 80 555 2645 © cycle count down another 5 X © area up 23% © power down another 3 X ! 6/17/2021 Platform Design H. Corporaal and B. Mesman 25

GSM viterbi example : summary Implementation Independent Design Database Mistral 2 72 x !

GSM viterbi example : summary Implementation Independent Design Database Mistral 2 72 x ! 6/17/2021 Platform Design H. Corporaal and B. Mesman 26

Discussion: phase 3 processor model application(s) SW (code generation) HW design no no OK?

Discussion: phase 3 processor model application(s) SW (code generation) HW design no no OK? application(s) Freeze processor model no yes more appl. ? Platform Design OK? yes no Exploration phase 6/17/2021 SW (code generation) Application software development: constraint driven compilation H. Corporaal and B. Mesman 27

Discussion: problems with VLIWs code size and instruction bandwidth • code compaction = reduce

Discussion: problems with VLIWs code size and instruction bandwidth • code compaction = reduce code size after scheduling possible compaction ratio ? e. g. p 0 = 0. 9 and p 1 = 0. 1 information content (entropy) = - pi log 2 pi = 0. 47 maximum compression factor 2 • control parallelism during scheduling = switch between different processor models (10% of code = 90% runtime) • architecture reduce number of control bits for operand addresses e. g. 128 reg (TM) -> 28 bits/issue slot for addresses only => use stacks and fifos 6/17/2021 Platform Design H. Corporaal and B. Mesman 28

RF 1 RF 2 RF 3 RF 4 FU 1 FU 2 FU 3

RF 1 RF 2 RF 3 RF 4 FU 1 FU 2 FU 3 FU 4 flags IR 1 IR 2 IR 3 Instruction memory 6/17/2021 Platform Design H. Corporaal and B. Mesman IR 4 Control 29

Conclusions • ASIPs provide efficient solutions for well-defined application domains (2 orders of magnitude

Conclusions • ASIPs provide efficient solutions for well-defined application domains (2 orders of magnitude higher efficiency). • The methodology is interesting for IP creation. • The key problem is retargetable compilation. • A (distributed) VLIW model is a good compromise between HW and SW. • Although an automatic process can generate a default solution, the process usually is interactive and iterative for efficiency reasons. The key is fast and accurate feedback. 6/17/2021 Platform Design H. Corporaal and B. Mesman 30