Adap No C A Fast And Flexible FPGAbased

Adap. No. C: A Fast And Flexible FPGA-based No. C Simulator 26 th International Conference on Field-Programmable Logic and Applications Lausanne, Switzerland, 29 th Aug. – 2 nd Sep. , 2016 Hadi Mardani Kamali and Shaahin Hessabi Department of Computer Engineering Sharif University of Technology, IR

Outline • Motivation • Approach • Proposed Architecture ∞ Router µArchitecture ∞ Dual-clock TDMA-based Virtualization ∞ TGs/TRs migration to system-side • Summary ∞ Configurable parameters • Evaluation Results • Conclusion 2/31

Motivation • Increasing the number of cores ∞ Many-core systems ∞ Approximately 100 to 1000 cores • Inefficient software simulators ∞ Low throughput ∞ Inability to simulate many-core systems ∞ Difficulties in implementing cycle-accuracy • Inflexible FPGA-based simulators ∞ Restricted configurable parameters ∞ Design and run complexity Motivation Approach Router µArch. Dual-clock TGs/TRs Migr. Summary Eval. Result 3/31

Motivation (Cont. ) • Software-based simulators Easy to develop, modify, and run Integration capability with full-system simulators BOOKSIM • Cycle-accuracy • Integration with GEM 5 SICOSYS • Integration with RSIM • Integration with GEMS Too slow in larger networks Inability to simulate and assess many-core systems Motivation Approach Router µArch. Dual-clock TGs/TRs Migr. Summary Eval. Result 4/31
![Motivation (Cont. ) • FPGA-based simulators No. C Simulators Papamichael [2011] Ac. ENo. Cs Motivation (Cont. ) • FPGA-based simulators No. C Simulators Papamichael [2011] Ac. ENo. Cs](http://slidetodoc.com/presentation_image_h/10f2300f54e1c593105a9cb44e925980/image-5.jpg)
Motivation (Cont. ) • FPGA-based simulators No. C Simulators Papamichael [2011] Ac. ENo. Cs [2011] DART [2014] FOLCS [2015] High throughput Topology Routing Traffic Pattern Router µArch. 4× 4 Scalability Mesh Do. R (size has Syn. Torus (XY) SW side tool Speedup (/BOOKSIM) Re-synth. Solution Virtualization C on host PC 28 x - TDMA based RS 232 on-chip soft-core 14 x-47 x - - PCIe C on host PC 100 x Yes TDMA based RS 232 C on host PC 17 x-22 x - TDMA based no effect on throughput) 4/8/12/16 -port RS 232 2/4/8 -VC Ability to implement many-core systems 5 -port Syn. + Trace 2 -VC 1 -stage pipe 7× 7 up to 8 -port Do. R Inflexibility Syn. + Trace Mesh up to 4 -VC (XY) irregular 5 -stage pipe Design and Simulation Complexity 16× 16 Do. R 5 -port , 4 -VC Syn. Mesh(virt. ) (XY) 5 -stage pipe 5× 5 Mesh Do. R (XY) ? !? Adap. No. C Motivation HW/SW Interface Approach Router µArch. Dual-clock TGs/TRs Migr. Summary Eval. Result 5/31

Adap. No. C Approach • Simulating adaptive routing algorithms Centralized traffic aggregator Gathering traffic information from intermediate queues Implementing an adaptive routing algorithm sample ATDOR • Dual-clock virtualization methodology Minimizing overhead time of unparalleled process Implementing a sharable Time Division Multiplexing • TGs/TRs migration to software side Maximizing the feasible number of simultaneous nodes on FPGA Handling trace-driven (dynamic) traffic Motivation Approach Router µArch. Dual-clock TGs/TRs Migr. Summary Eval. Result 6/31

Adap. No. C Overall Comparison • FPGA-based simulators No. C Simulators Topology Routing Traffic Pattern Router µArch. HW/SW Interface SW side tool Speedup (/BOOKSIM) Re-synth. Solution Virtualization Papamichael [2011] 4× 4 Mesh Torus Do. R (XY) Syn. 4/8/12/16 -port 2/4/8 -VC RS 232 C on host PC 28 x - TDMA based Ac. ENo. Cs [2011] 5× 5 Mesh Do. R (XY) RS 232 on-chip soft-core 14 x-47 x - - PCIe C on host PC 100 x Yes TDMA based RS 232 C on host PC 17 x-22 x - PCIe on-chip soft-core 53 x-180 x Yes DART [2014] FOLCS [2015] Adap. No. C Motivation Syn. + Trace 7× 7 Do. R Syn. + Trace Mesh (XY) irregular 16× 16 Do. R Syn. Mesh(virt. ) (XY) 32× 32 Mesh Do. R Torus Syn. + Trace + Adaptive (virt. ) Approach Router µArch. 5 -port 2 -VC 1 -stage pipe up to 8 -port up to 4 -VC 5 -stage pipe 5 -port , 4 -VC 5 -stage pipe 5 -port up to 4 -VC (3+)-stage pipe Dual-clock TGs/TRs Migr. Summary TDMA based Dual-clock TDMA based Eval. Result 7/31

Adap. No. C Overall Architecture • Hybrid Architecture Hardware (Logic Side) Software (System Side) Interconnection Network Components Dual-Processor Arch. Router Micro Arch. MB 1 MB 2 Host + Trace Modeling network components Crossbar Routing Algorithm Flit Queues TGs + TRs Motivation Approach Router µArch. Dual-clock Gathering Statistical Information TGs/TRs Migr. Summary Eval. Result 8/31

Router µ-Architecture • Centralized Vector-based Arch. Aggregator Monitor routers buffers traffic Down-stream flit queues in each port Sample scaling Aggregate all ports info. Send via dedicated port Store to traffic load aggregator bank reg. Updater Pick up all related nodes traffic load info. Calculate all source-destination pairs routing Consider all congestions Renew routing table info Send again to routers Motivation Approach Router µArch. Dual-clock TGs/TRs Migr. Summary Eval. Result 9/31

Router µ-Architecture • Centralized Vector-based Arch. Dedicated vector-based traffic aggregation dedicated port aside all N/E/W/S ports Source-based routing Table-based adaptive routing algorithms Complexity of adaptive deadlock-free routing algorithm Adaptive Toggle DOR (ATDOR) Toggling between XY and YX Related to traffic congestion Motivation Approach Router µArch. Dual-clock TGs/TRs Migr. Summary Eval. Result 10/31

Router µ-Architecture • Centralized Vector-based Arch. Adaptive Toggle DOR (ATDOR) structure Source-destination overall toggling Consider both XY and YX current congestion info. Based on global traffic congestion Motivation Approach Current_Route == XY If(Congs. Path. XY >= Congs. Path. Yx) Next_Route = YX Current_Route == YX If(Congs. Path. XY >= Congs. Path. Yx) Next_Route = YX If(Congs. Path. XY =< Congs. Path. Yx) Next_Route = XY Router µArch. If(Congs. Path. XY =< Congs. Path. Yx) Next_Route = XY Dual-clock TGs/TRs Migr. Summary Eval. Result 11/31

Router µ-Architecture • Centralized Vector-based Arch. Traffic load info tables For each Node! Four traffic load tables: E, W, S, N scaled number of waiting flits in queues Mesh example port N X 1 2 3 4 1 0 0 2 2 0 0 0 3 3 0 0 1 4 3 0 0 1 Y Motivation Approach Router µArch. port S Source routing table X 1 2 3 4 Y 0 0 1 1 0 0 0 1 0 2 1 0 0 0 2 0 3 0 0 0 * 3 0 0 0 0 4 Dual-clock port E X port W 1 2 3 4 1 0 0 2 3 3 2 0 Y Generate new 0 0 Flag 0 0 3 Toggle 0 0 table) 0 0 4 (routing TGs/TRs Migr. Summary X 1 2 3 4 1 0 0 2 0 0 Y 0 current 0 0 3 Load 0 4 2 routing table 0 3 3 Eval. Result 12/31

Router µ-Architecture • Overall router µ-Arch. Updatable routing table Five-port with (3+) pipeline stage Switch setup Switch traversal Link traversal Up to 4 VCs per port Credit-based flow control Wormhole VC Motivation Approach Router µArch. Dual-clock TGs/TRs Migr. Summary Eval. Result 13/31

Dual-clock TDMA-based Virtualization • Virtualization Approach Limited FPGA resources Restriction in size of simulated network P&R complexity and difficulty in implementation phase Compulsory reduction in maximum frequency Throughput degradation Virtualization methodology Time Division Multiplexing (TDM) based structure Clusters are exclusively implemented on FPGA Each cluster has a definite number of nodes Serialization simulation Each cluster is placed on FPGA in separate time-slots Motivation Approach Router µArch. Dual-clock TGs/TRs Migr. Summary Eval. Result 14/31

Dual-clock TDMA-based Virtualization • Cluster-based virtualization Up to 16 nodes in each cluster Two transmission categories: Intra-cluster Handled in each time-slot Inter-cluster Handled non-virtualized buffering BRAM-based queues Motivation Approach Router µArch. Dual-clock TGs/TRs Migr. Summary Eval. Result 15/31

Dual-clock TDMA-based Virtualization • Sharing capability of IDLEs in TDM Using traffic load tables contents Detecting IDLE clusters from down-stream queues Sharing time-slots in IDLE clusters IDLE cluster No intra-cluster transmission IDLEness High percentage Low injection Corner nodes Less than 5% channel utilization [Hesse-No. CS-2012] In real-world multi-core application Motivation Approach Router µArch. Dual-clock TGs/TRs Migr. Summary Eval. Result 16/31

Dual-clock TDMA-based Virtualization • Dual-clock context switching Time overhead in TDM Wasted clock cycles in context switching Especially for high cluster to node ratio sharing time-slots in IDLE clusters Dual-clock virtualization structure State-handler clock System clock State-handler freq. / sys. Freq. = 2/1 Context-switching via state-handler Motivation Approach Router µArch. Dual-clock TGs/TRs Migr. Summary Eval. Result 17/31

Dual-clock TDMA-based Virtualization • Dual-clock context switching Dual-clock virtualization structure State-handler freq. / sys. freq. = 2/1 Motivation Approach Router µArch. Dual-clock TGs/TRs Migr. Summary Eval. Result 18/31

TGs/TRs Migration to Software Side • Traffic Generators Associated with each cluster Synthetic Random Bit-complement Transpose Dynamic (Trace-driven) PARSEC Dedicated MB for packet injection + statistics Generating synthetic traffic Receiving and calculating statistical information Dedicated MB for trace-driven traffic Receiving and decoding dynamic traffic Motivation Approach Router µArch. Dual-clock TGs/TRs Migr. Summary Eval. Result 19/31

TGs/TRs Migration to Software Side • Flit (Source) Queues Associated with each port FIFO structure as a subset of TG Dynamically allocated in run-time Equivalent with flit size • Traffic Receptors Associated with each cluster Decoding received packet information Calculating statistical information Packet latency Motivation Approach Router µArch. Dual-clock TGs/TRs Migr. Summary Eval. Result 20/31

Adap. No. C HW Side Summary • A wide range configurable parameters • Topologies • Mesh • Torus • Routing Algorithms • Deterministic DOR (XY and YX) • Adaptive (ATDOR) • VC/Switch • Up to 4 VCs per port • Variable delay link traversal • Layout • Virtualized • Pure Motivation Approach Router µArch. Dual-clock TGs/TRs Migr. Summary Eval. Result 21/31

Adap. No. C HW Side Summary • A wide range configurable parameters • (3+)-stage pipelined router µ-architecture • Number of nodes • Up to 1024 virtualized • Up to 64 pure nodes • Traffic • Dynamic • PARSEC • Synthetic • Random • Bit-complement • Transpose Motivation Approach Router µArch. Dual-clock TGs/TRs Migr. Summary Eval. Result 22/31

Adap. No. C Overall Comparison • FPGA-based simulators No. C Simulators Topology Routing Traffic Pattern Router µArch. HW/SW Interface SW side tool Speedup (/BOOKSIM) Re-synth. Solution Virtualization Papamichael [2011] 4× 4 Mesh Torus Do. R (XY) Syn. 4/8/12/16 -port 2/4/8 -VC RS 232 C on host PC 28 x - TDMA based Ac. ENo. Cs [2011] 5× 5 Mesh Do. R (XY) RS 232 on-chip soft-core 14 x-47 x - - PCIe C on host PC 100 x Yes TDMA based RS 232 C on host PC 17 x-22 x - PCIe on-chip soft-core 53 x-180 x Yes DART [2014] FOLCS [2015] Adap. No. C Motivation Syn. + Trace 7× 7 Do. R Syn. + Trace Mesh (XY) irregular 16× 16 Do. R Syn. Mesh(virt. ) (XY) 32× 32 Mesh Do. R Torus Syn. + Trace + Adaptive (virt. ) Approach Router µArch. 5 -port 2 -VC 1 -stage pipe up to 8 -port up to 4 -VC 5 -stage pipe 5 -port , 4 -VC 5 -stage pipe 5 -port up to 4 -VC (3+)-stage pipe Dual-clock TGs/TRs Migr. Summary TDMA based Dual-clock TDMA based Eval. Result 23/31

Evaluation and Results • HW Implementation Consideration • • Verilog HDL PLI (DPI)-based debug process Microblaze IP-core Xilinx Virtex-6 ML 605 evaluation board • XC 6 VLX 240 T • AXI-based interconnection • PIO-based PCIe interface • Using DDR 3 RAM Motivation Approach Router µArch. Dual-clock TGs/TRs Migr. Summary Eval. Result 24/31

Evaluation and Results • Average Latency Adap. No. C • 4. 6% error Random attribute Motivation Approach Router µArch. 25 23 21 19 17 0. 6 0. 5 0. 4 6 0. 3 2 0. 3 8 0. 2 4 0. 2 6 0. 1 2 0. 1 9 0. 0 7 0. 0 5 0. 0 0. 03 15 1 • 20 K warmup • 45 K measurement Case 1 Parameters Topology 3× 3 mesh Link Latency 1 cycle Routing Algorithm XY Traffic Pattern random Number of VCs 2 Input VC buffer size 4 Packet Size 2 27 0. 0 • Packet details DART [22] 29 Average Latency (flit cycles) • Booksim 2. 0 [1] Injection Rate (flits/node/cycle) Dual-clock TGs/TRs Migr. Summary Eval. Result 25/31

Evaluation and Results • Adaptive evaluation Traffic Pattern random Number of VCs Input VC buffer size Packet Size 2 4 16 0. 0 0. 4 0. 18 0. 22 0. 26 0. 34 0. 38 0. 42 0. 46 0. 54 0. 58 0. 62 1 cycle ATDOR 0. 1 Link Latency Routing Algorithm 0. 1 8× 8 mesh 6 Topology Adap. No. C adaptive XY O 1 Turn [33] 100 95 90 85 80 75 70 65 60 55 50 45 40 02 Case 2 Average Latency (cycles) Parameters Adap. No. C deterministic XY Centralized ATDOR [29] Injection Rate (flits/node/cycle) Motivation Approach Router µArch. Dual-clock TGs/TRs Migr. Summary Eval. Result 26/31

Evaluation and Results • Resource Utilization • 8× 8 mesh • non-virtualized • Just 72% Res. Util. Motivation Approach Router µArch. Parameters Slices Registers % of Utilization Flit Queue 5564 2472 13% Router 11456 4856 18% Crossbar Switch + Arbiter 4391 3857 9. 5% Traffic Aggregator and Updater 3265 1650 7. 5% PCIe and AXI 4 Core 2126 205 2% MICROBLAZE controller 1154 165 1% Dual-clock TGs/TRs Migr. Summary Eval. Result 27/31

• Different network size • Mesh • • 2× 2 3× 3 4× 4 5× 5 6× 6 7× 7 8× 8 • Virtualization impact! Simulation Speed (cycles per second)Thousands Evaluation and Results 2× 2 Approach Router µArch. 4× 4 5× 5 6× 6 7× 7 8× 8 800 700 600 500 400 300 200 100 0 0. 05 Motivation 3× 3 Dual-clock 0. 15 0. 2 TGs/TRs Migr. 0. 25 0. 35 Injection Rate Summary 0. 45 0. 55 Eval. Result 0. 6 28/31

Evaluation and Results • Virtualization Adap. No. C 16 • Sharable time-slot • Dual-clock architecture 14 Context-switching 12 Simulation Cycles • DART [22] 10 8 6 4 2 0 0. 05 0. 15 0. 25 0. 35 0. 45 Injection Rate (flits/node/cycle) Motivation Approach Router µArch. Dual-clock TGs/TRs Migr. Summary Eval. Result 29/31

• Simulation speed • Comparison with baselines • • DART Ac. ENo. Cs Simulation Speed (cycles per second) Thousands Evaluation and Results Adap. No. C Ac. ENo. Cs [20] DART [22] 300 250 200 150 100 50 0 0. 05 Motivation Approach Router µArch. Dual-clock 0. 15 TGs/TRs Migr. 0. 25 0. 3 Injection Rate Summary 0. 35 0. 45 Eval. Result 0. 5 30/31

Summary • Software-configurable FPGA-based No. C simulator • Adaptive routing algorithms simulation Centralized traffic aggregator Gathering traffic information from intermediate queues Implementing an adaptive routing algorithm sample ATDOR • Dual-clock virtualization methodology Minimizing overhead time of unparalleled process Implementing a sharable Time Division Multiplexing • TGs/TRs migration to software side Maximizing the feasible number of simultaneous nodes on FPGA Handling trace-driven (dynamic) traffic 31/31

- Slides: 32