Roadmap Background Decoupled Spatial Architectures DecoupledSpatial Architecture Programming
Roadmap • Background: Decoupled Spatial Architectures • Decoupled-Spatial Architecture Programming • Compiling High-Level- Lang. to DSA • Advanced DSA Programming • Composing a Spatial Architecture • The Scala Embedded DSL 1
Hardware Stack: Composing Spatial Architecture • Overview of DSAGEN H/W Stack (3 mins) • Goal: Build Your Own Spatial Architecture (25 mins) • A Scala Embedded DSL – Build A Simple CGRA (10 mins) • Generate IR/RTL (3 mins)�� • Power/Area Estimation (2 mins)�� • Build your own CGRA (10 mins)�� • Advanced Hardware Features (5 mins) • A End-to-End Approach (5 mins) • Q & A 2
DSAGEN Hardware Stack 1 • Define your own architecture. Generate the IR for compiler 3 • Generate the RTL for hardware research 4 • * Design-Space Exploration 2 • 2 1 4 3 • Chip. Yard Integration • Run Stream-Dataflow Program Binary on DSAGEN • Integrate with other IPs *Design-Space Exploration will not be covered today 3
Hardware Stack: Composing Spatial Architecture • Overview of DSAGEN H/W Stack (3 mins) • Goal: Build Your Own Spatial Architecture (25 mins) • A Scala Embedded DSL – Build A Simple CGRA (10 mins) • Generate IR/RTL (3 mins)�� • Power/Area Estimation (2 mins)�� • Build your own CGRA (10 mins)�� • Advanced Hardware Features (5 mins) • A End-to-End Approach (5 mins) • Q & A 4
A Scala Embedded DSL – Build A Simple CGRA RISCV CPU Memory Recurrence a[0: n] Mem. Controller b[0: n] • Our Spatial Architecture: • CGRA • Processing Element • Switch • Vector Port • Memory System • RISC-V CPU (Rocket Core) + c[0: n] 5
A Scala Embedded DSL – Build A Simple CGRA 6
A Scala Embedded DSL – Build A Simple CGRA val vport_i 1 = new ssnode("vector port") vport_i 1 vport_i 2 val vport_i 2 = new ssnode("vector port") val vport_o 1 = new ssnode("vector port") sw 2 sw 1 val sw 1 = new ssnode("switch") val sw 2 = new ssnode("switch") pe val sw 3 = new ssnode("switch") sw 3 vport_o 1 sw 4 val sw 4 = new ssnode("switch") val pe = new ssnode("processing element") pe("instructions", Seq("Add", "Mul")) 7
A Scala Embedded DSL – Build A Simple CGRA vport_i 1 vport_i 2 sw 1 pe sw 3 vport_o 1 sw 4 // define the CGRA fabric val my_cgra = new ssfabric my_cgra("default_data_width", 64) // define connection my_cgra( vport_i 1 --> sw 1)( vport_i 2 --> sw 2)( sw 3 --> vport_o 1)( sw 1 <-> sw 2)( sw 1 <-> sw 3)( sw 2 <-> sw 4)( sw 3 <-> sw 4)( sw 1 --> pe)( sw 2 --> pe)( sw 3 --> pe)( sw 4 <-> pe ) 8
A Scala Embedded DSL – Build A Simple CGRA // dsa-cgra-gen/src/main/real/micro_demo package real import dsl. _ object micro_demo extends App{ vport_i 1 vport_i 2 sw 1 sw 2 // define vector port val vport_i 1 = new ssnode("vector port") val vport_i 2 = new ssnode("vector port") val vport_o 1 = new ssnode("vector port") // define val sw 1 = val sw 2 = val sw 3 = val sw 4 = pe sw 3 switch new ssnode("switch") // define processing element that can do Add and Mul val pe = new ssnode("processing element") pe("instructions", Seq("Add", "Mul")) // define the CGRA fabric val my_cgra = new ssfabric my_gra("default_data_width", 64) sw 4 // define connection my_cgra( vport_i 1 --> sw 1)(vport_i 2 --> sw 2)( sw 3 --> vport_o 1)( sw 1 <-> sw 2)(sw 1 <-> sw 3)(sw 2 <-> sw 4)(sw 3 <-> sw 4)( sw 1 --> pe)(sw 2 --> pe)(sw 3 --> pe)(sw 4 <-> pe ) vport_o 1 } 9
Hardware Stack: Composing Spatial Architecture • Overview of DSAGEN H/W Stack (3 mins) • Goal: Build Your Own Spatial Architecture (25 mins) • A Scala Embedded DSL – Build A Simple CGRA (10 mins) • Generate IR/RTL (3 mins)�� • Power/Area Estimation (2 mins)�� • Build your own CGRA (10 mins)�� • Advanced Hardware Features (5 mins) • A End-to-End Approach (5 mins) • Q & A 10
Before we start. . . • Tool(s) I will use: ss_sched (in dsa-release. zip) • Repositories: dsa-framework/dsa-cgra-gen, dsa-examples • dsa-framework/dsa-cgra-gen $ $ cd <you workspace> git clone https: //github. com/Poly. Arch/dsa-framework. git cd dsa-framework git submodule update --init dsa-cgra-gen • dsa-examples $ cd <you workspace> $ git clone https: //github. com/Poly. Arch/dsa-examples. git • Optional Tools: • Graphviz: $ sudo apt-get install graphviz 11
Generate IR/RTL �� 1. Add print. IR to the end of CGRA description file // dsa-cgra-gen/src/main/real/micro_demo package real import dsl. _ object micro_demo extends App{ // define vector port ………… // define switch ………… 2. Generate IR through sbt $ cd dsa-framework/dsa-cgra-gen $ sbt: dsa-cgra-gen> run. Main real. micro_demo 3. Check the generated IR // dsa-cgra-gen/my_cgra. json { "default_data_width" : 64, "identifier" : [ "id", "node. Type" ], // define processing element that can do Add and Mul ………… // define the CGRA fabric ………… // define connection ………… // print IR my_cgra. print. IR(filename = "my_cgra") } . . "module_type" : "cgra. fabric. cgra_fabric" } 4. Generate RTL (Verilog) $ cd dsa-framework/dsa-cgra-gen && sbt $ sbt: dsa-cgra-gen> cgra. driver. generator my_cgra. json 12
Hardware Stack: Composing Spatial Architecture • Overview of DSAGEN H/W Stack (3 mins) • Goal: Build Your Own Spatial Architecture (25 mins) • A Scala Embedded DSL – Build A Simple CGRA (10 mins) • Generate IR/RTL (3 mins)�� • Power/Area Estimation (2 mins)�� • Build your own CGRA (10 mins)�� • Advanced Hardware Features (5 mins) • A End-to-End Approach (5 mins) • Q & A 13
Power/Ares Estimation�� 1. Power and Area Estimation is done via spatial scheduler $ cp dsa-framework/dsa-cgra-gen/my_cgra. json dsa-examples/hw $ cd dsa-examples/hw $ ss_sched my_cgra. json 2. Power/Area Estimation Results via Spatial-Scheduler ----- Sample Results like below ----FU: 422075 um 2 447. 399 mw Network: 91593 um 2 22. 8983 mw Sync: 139971 um 2 96. 2 mw Memory: 39325 um 2 12. 14 mw ----- What’s your result ? ----14
Hardware Stack: Composing Spatial Architecture • Overview of DSAGEN H/W Stack (3 mins) • Goal: Build Your Own Spatial Architecture (25 mins) • A Scala Embedded DSL – Build A Simple CGRA (10 mins) • Generate IR/RTL (3 mins)�� • Power/Area Estimation (2 mins)�� • Build your own CGRA (10 mins)�� • Advanced Hardware Features (5 mins) • A End-to-End Approach (5 mins) • Q & A (2 mins) 15
Build Your Own CGRA �� Can you build a CGRA like this? vport_i 3 vport_i 1 sw 5 vport_i 2 Answer: // dsa-cgra-gen/src/main/scala/real/micro_demo_answer package real import dsl. _ object micro_demo_sihao extends App{. . . . // put your code here sw 2 sw 1 // Add a new vector port val vport_i 3 = new ssnode("vector port") // Add a new switch val new_sw = new ssnode(“switch") Add/ Mul Div sw 3 // Add a new processing element val pe_div = new ssnode("processing element") pe_div("instructions", Seq(“UDiv")) sw 4 // Add connections to fabric my_cgra(vport_i 3 --> new_sw)( sw 1 --> pe_div)( new_sw --> pe_div)( pe_div --> sw 3) vport_o 1 // print IR my_cgra. print. IR(filename = "my_cgra_div") } 16
Visualize Your Own CGRA �� • Perform spatial schedule $ cd <your workspace>/dsa-examples/hw $ ss_sched my_cgra_div. json add-div. dfg • Visualize the schedule result (graphviz required) $ cd <your workspace>/dsa-examples/hw/viz $ neato -Goverlap=false -Gstart=self -Gepsilon=. 0000001 -Tpng -o add-div_my-cgra. png. /add-div. my_cgra_div. gv 17
Visualize Your Own CGRA �� 18
Hardware Stack: Composing Spatial Architecture • Overview of DSAGEN H/W Stack (3 mins) • Goal: Build Your Own Spatial Architecture (25 mins) • A Scala Embedded DSL – Build A Simple CGRA (10 mins) • Generate IR/RTL (3 mins)�� • Power/Area Estimation (2 mins)�� • Build your own CGRA (10 mins)�� • Advanced Hardware Features (5 mins) • A End-to-End Approach (5 mins) • Q & A (2 mins) 19
Basic Concept: Switch Simplest Switch Config. MUX // Define Switch val my_sw = new ssnode("switch") MUX • Simplest Switch • 4 Input Ports • 3 Output Ports MUX • 1 9 -bit Config. Register • For each output, we need 3 bit to specify 5 inputs (Input 1~4 + Ground) x 3 Outputs = 9 -bit config • Decode config. information A 4 -Input 3 -Output Switch 20
Basic Concept: Processing Element // Define Your Processing Elements with Add / Sub / Mul MUX Instruction Scheduler MUX Instruction Buffer val my_fu = new ssnode(“processing elements") my_fu( "instructions", Seq("Add", "Sub", "Mul"))( "data_width", 64) Function Unit Register File • The number of operand is decided by ALU, which will affect the width of config. file • Hardware should be reused if some operations are subset of a bigger operation (like ADD 16 x 4 and ADD 64) 21
16 16 16 Mul 16 Cat 16 16 16 here: we shift data by 16 -bit red means data is valid 16 16 32 Hardware Support 1 16 6 in in val id va id lid 16 6 1 16 16 Mul 16 16 6 1 M PE ul 16 16 D d li va M PE ul 16 B C A inv valid inv alid 16 16 16 Ca PE t 1 6 Example DFG inv valid inv alid 16 16 16 1 16 6 in in val id va id lid Basic Concept: Decomposability here: we keep data at its original place 16 16 16 32 22 d li va
Advanced Features: Decomposability in Switch Define Decomposable Switch Config. MUX // Define Decomposable Switch val my_switch = new ssnode("switch") my_switch( "data_width", 64)( "granularity", 16)( "subnet_offset", Seq(0, 1, 2)) MUX Sub-net Shifter Offset 0 x 1234__5678__9 abc__de 01 subnet_offset = 2 subnet_offset = 1 0 x 5678__9 abc__de 01__1234 0 x 9 abc__de 01__1234__5678 MUX • In order to achieve decomposability, we add one subnet shifters for each output • The default offset is zero for each output A 4 -Input 3 -Output Switch 23
Advanced Features: Decomposability in PE // Define Your own Data-dependent Controlled // Processing Elements with GE / LE and Add val my_fu = new ssnode(“processing element") my_fu( "instructions", Seq(“Add"))( "data_width", 64)( // Here we decompose the processing elements "granularity", 32) "granularity", 64 "granularity", 32 • All you need to do is just changing the granularity from 64 to 32. 24
Advanced Features: Dynamic Flow Control Define Decomposable Dynamic Switch Conf. Valid/Ready MUX Valid/Ready A 4 In-3 Out Decomposable Dynamic Switch Valid/Ready // Define Decomposable Dynamic Switch val my_switch = new ssnode("switch") my_switch( "data_width", 64)( "granularity", 16)( "flow_control", true ) • In order to build dynamic switch, we need to add a Queue for every output. • For every IO port, two more wires (Valid/Ready) are added to implement backpressure. 25
Advanced Features: Data-Controlled SW SE NW NE Data-Flow-Ctrl // Define Your own Data-dependent Controlled // Processing Elements with GE / LE and Add val my_fu = new ssnode(“processing elements") my_fu( "instructions", Seq("Self. Ctrl. GE", "Self. Ctrl. LE", "Input. Ctrl. Min"))( "data_width", 64) Stream-Join FIFO 0 FIFO 2 FIFO 1 reuse ACC reset Func. Unit CLT discard SE Data-dependent Control Processing Element • A data-dependent controlled processing element is created based on its `instructions` field • Self-Control (`Self. Ctrl`) instructions will use its own computed result to access CLT • Input-Control (`Input. Ctrl`) instruction will use the input from upstream to access CLT 26
Advanced Features: Mesh Topology PE PE PE PE PE PE PE // Keys that differ nodes identifier("row_idx", "col_idx" ) // Building a 5 x 5 processing elements val pe. Array = Array(pe_spc, pe_add, Array(pe_add, Array(pe_add, another_ pe, Array(pe_add, ) array pe_add, another_pe, • ID is by default to identify a node, but you can add more identifier. • Like for mesh topology, natural identifiers will be the `column index` and `row index`. pe_spc), pe_add), 27
Advanced Features: Hardware/Software Tradeoff • Compare the Hardware Tradeoff of different delay FIFO depth -> % ss_sched my_cgra_delay_depth_4. json -> % ss_sched my_cgra_delay_depth_16. json // Input FIFO (depth = 4). . // Input FIFO (depth = 16). . FU: 4990. 85 um 2 5. 23123 mw Network: 1891. 03 um 2 0. 472758 mw Sync: 5238 um 2 3. 6 mw Memory: 39325 um 2 12. 14 mw FU: 4990. 85 um 2 5. 23123 mw Network: 1891. 03 um 2 0. 472758 mw Sync: 8730 um 2 6 mw Memory: 39325 um 2 12. 14 mw 28
Hardware Stack: Composing Spatial Architecture • Overview of DSAGEN H/W Stack (3 mins) • Goal: Build Your Own Spatial Architecture (25 mins) • A Scala Embedded DSL – Build A Simple CGRA (10 mins) • Generate IR/RTL (3 mins)�� • Power/Area Estimation (2 mins)�� • Build your own CGRA (10 mins)�� • Advanced Hardware Features (5 mins) • A End-to-End Approach (5 mins) • Q & A (2 mins) 29
A End-to-End Approach�� • Add DSAGEN Hardware Generator (dsa-cgra-gen) into Chip. Yard class SSCgra. Config extends Config( new chipyard. iobinders. With. UARTAdapter ++ new chipyard. iobinders. With. Tie. Off. Interrupts ++ new chipyard. iobinders. With. Black. Box. Sim. Mem ++ new chipyard. iobinders. With. Tied. Off. Debug ++ new chipyard. iobinders. With. Sim. Serial ++ new testchipip. With. TSI ++ new chipyard. config. With. Boot. ROM ++ new chipyard. config. With. UART ++ new chipyard. config. With. L 2 TLBs(1024) ++ new freechips. rocketchip. subsystem. With. No. MMIOPort ++ new freechips. rocketchip. subsystem. With. No. Slave. Port ++ new freechips. rocketchip. subsystem. With. Inclusive. Cache ++ new ss_cgra_gen. With. Ss. Cgra ++ // mixed with Stream-Dataflow CGRA new new freechips. rocketchip. subsystem. With. NExt. Top. Interrupts(0) ++ freechips. rocketchip. subsystem. With. NBig. Cores(1) ++ freechips. rocketchip. subsystem. With. Coherent. Bus. Topology ++ freechips. rocketchip. system. Base. Config ) Notice: ss-chipyard (we have not released it in dsa-framework yet) has been integrated with dsa-cgra-gen 30
A End-to-End Approach�� • Generate the binary from C program with Stream-Dataflow Instruction $ riscv 64 -unknown-elf-gcc -fno-common -fno-builtin-printf specs=htif_nano. specs -o vec_add-ss. dsa-gnu. o -c vec_add-ss. cc • Link kernel via chipyard riscv-toolchain $ riscv 64 -unknown-elf-gcc -static -specs=htif_nano. specs vec_add-ss. dsa-gnu. o -o vec_add-ss. cy-gnu. riscv • Run binary program on hardware simulator (verilator) $ make run-binary-debug SUB_PROJECT=ss_cgra_gen BINARY=<dir to binary>/vec_add-ss. cy-gnu. riscv Notice: ss-chipyard (we have not released it in dsa-framework yet) has been integrated with dsa-cgra-gen 31
Hardware Stack: Composing Spatial Architecture • Overview of DSAGEN H/W Stack (3 mins) • Goal: Build Your Own Spatial Architecture (25 mins) • A Scala Embedded DSL – Build A Simple CGRA (10 mins) • Generate IR/RTL (3 mins)�� • Power/Area Estimation (2 mins)�� • Build your own CGRA (10 mins)�� • Advanced Hardware Features (5 mins) • A End-to-End Approach (5 mins) • Q & A (2 mins) Any questions? 32
- Slides: 32