Packet Switching on Raw Research Qualifying Exam Gleb

Packet Switching on Raw Research Qualifying Exam Gleb A Chuvpilo January 28, 2005

Project Publications • High-Bandwidth Packet Switching on the Raw General-Purpose Architecture, Gleb A. Chuvpilo and Saman Amarasinghe In Proceedings of the International Conference on Parallel Processing (ICPP-03), Kaohsiung, Taiwan, Republic of China, October 6 -9, 2003. • High-Bandwidth Packet Switching on the Raw General-Purpose Architecture, Gleb A. Chuvpilo, S. M. Thesis, Massachusetts Institute of Technology, Cambridge, Massachusetts, August, 2002. • Raw. Net: Network Processing on the Raw Processor, David Wentzlaff, Gleb A. Chuvpilo, Arvind Saraf, Saman Amarasinghe, and Anant Agarwal, In Research Abstracts of the MIT Laboratory for Computer Science, Cambridge, Massachusetts, March 2002. • Gigabit IP Routing on Raw, Gleb A. Chuvpilo, David Wentzlaff, and Saman Amarasinghe, In Proceedings of the 1 st HPCA Workshop on Network Processors, Cambridge, Massachusetts, February 3, 2002. • Also, unpublished work on Network Calculus at the Computer Engineering and Networks Laboratory of the ETH Swiss Federal Institute of Technology

Outline • Introduction – Raw Processor Overview – Internet Router Overview • Packet Switching on Raw – – Raw Router Architecture Rotating Crossbar Design for Switch Fabric Distributed Scheduling Algorithm Minimization and Scheduling • Results • Conclusion

Introduction

Goal • Build an IP router on a general-purpose processor • Why? – Flexibility new protocols and services – Price economies of scale

Raw

Raw Processor • A scalable computation fabric – 4 x 4 mesh of tiles, each tile is a RISC microprocessor • Ultra fast interconnect network – Exposes the wires to the compiler – Compiler orchestrates the communication

Raw Facts • Performance – 16 OPS/FLOPS per cycle – 230 Gb/s of on-chip “bisection bandwidth” – 201 Gb/s off-chip I/O bandwidth – 57 GB/s of on-chip memory bandwidth

Raw Facts • Layout – Longest wire is the length of tile fast clocking – Each tile: • • MIPS R 4000 + router + interconnect 32 KB IMEM 32 KB data cache 64 KB SMEM 2 MB total per chip

Raw Facts • Instruction Set Architecture – Eight stage pipeline: FETCH, DECODE, RF/STALL, EXE, MUL, MEM, FPU – MIPS instruction set – 28 general-purpose registers – 4 register-mapped network ports – 2 -way set-associative cache, 3 cycle latency, 32 byte lines

Raw Facts • Implementation – ASIC @ 250 MHz Worst Case – 122 million transistors (P 4: 43 million) – 18. 2 mm x 18. 2 mm die (P 4 : 15 mm x 15 mm) – 1080 signal I/O pins – 25 Watts – IBM SA-27 E 6 layer metal copper 0. 15μ process (P 4: 0. 13μ)

Raw Layout

Communication Mechanisms • 2 static networks • 2 dynamic networks

Static Networks • • Destinations known at compile time Message size known at compile time Cycle-by-cycle switch schedule Three-cycle nearest neighbor send-to-use latency • No processing overhead

Static Network: Send

Static Network: Receive

Dynamic Networks • Unpredictable events – External asynchronous interrupts – Cache misses • 15 - to 30 -cycle nearest neighbor send-to-use latency (message header processing overhead) • Wormhole routed, two-stage pipelined, dimension-ordered

Routing

What is Routing? RM OSI…

IP Router Network Processor Forwarding Engine Interface Forwarding Engine Switch Fabric Interface

Switch Fabric

Click Modular Router • Modular software router • MIT Parallel and Distributed OS Group • 435, 000 64 -byte packets a second on a 700 MHz Pentium III (commodity hardware) • Flexible, configurable, and easy to understand • Interconnected collection of modules called elements

Click Modular Router

Packet Switching on Raw

Problem: Four Networks… 2 1 4 3

… and Sixteen Tiles:

What is the Mapping? Dynamic Communicati on Static Interconnect

Solution: Rotating Crossbar Out 0 Lookup Processor Out 1 Egress Processor PORT 0 Ingress Processor Lookup Processor PORT 1 Crossbar Processor Ingress Processor In 1 Ingress Processor In 2 ROTATING CROSSBAR In 3 Ingress Processor Crossbar Processor PORT 3 Lookup Processor PORT 2 Egress Processor Out 3 Egress Processor Out 2 Lookup Processor

Switch Fabric Design • The idea of a Token Ring network absolute fairness • Algorithm uses two static networks, dynamic networks are idle • All deadlock-free configurations are scheduled at compile time • Four headers and token location define a global configuration • Global configuration is computed in a distributed manner at run time

Rotating Crossbar Illustrated Lookup Processor Egress Processor PORT 0 Ingress Processor Lookup Processor PORT 1 Crossbar Processor Ingress Processor ROTATING CROSSBAR Ingress Processor Crossbar Processor PORT 3 Lookup Processor Ingress Processor PORT 2 Egress Processor Lookup Processor

Rotating Crossbar Illustrated Lookup Processor Egress Processor PORT 0 Ingress Processor Lookup Processor PORT 1 Crossbar Processor Ingress Processor ROTATING CROSSBAR Ingress Processor Crossbar Processor PORT 3 Lookup Processor Ingress Processor PORT 2 Egress Processor Lookup Processor

Phases of the Algorithm TILE PROCESSOR SWITCH PROCESSOR headers_reque st headers send_prev_con fig choose_new_config route_body update_token confirm

Distributed Scheduling Algorithm • Let’s enumerate the number of configurations: SPACE = |Hdr 0| x … x |Hdr 3| x |Token|, where |Hdr 0| = … = |Hdr 3| = 5, and |Token| = 4 therefore SPACE = 54 x 4 = 2, 500 distinct configurations

So What? . . . • Each tile has 8, 192 words of instruction memory, same for switch 8, 192/2, 500 = 3. 3 instructions per configuration not enough! need to use off-chip memory slow! need to minimize SPACE

Minimization Egress Processor PORT 0 Ingress Processor in out cwnext Crossbar Processor ccwprev ccwnext Crossbar Processor

Clients and Servers of a Crossbar Processor servers clients out cwnext ccwnext in cwprev ccwprev

Minimization and Scheduling • We cut down the number of configurations by 78 times! Now there are only 32 entries! the program can fit in the local instruction memory! • Code generated by an automatic compiletime scheduler • In addition, software pipelining + loop unrolling of the assembly code of the switch processors of the crossbar to avoid deadlock

Scheduler Output /* AUTOGENERATED SCHEDULE FOR PORT 0 */ /* Tile Processor */ /* …*/ conf_1_0303: mtsri SW_PC, %lo(sw_conf_1000) j conf_done conf_1_0304: mtsri SW_PC, %lo(sw_conf_1000) j conf_done conf_1_0310: mtsri SW_PC, %lo(sw_conf_2001) j conf_done conf_1_0311: mtsri SW_PC, %lo(sw_conf_1210) j conf_done /* …*/ /* HAND-CODED SCHEDULE FOR PORT 0 */ /* Switch Processor */ /* …*/ /* in->out, prev->next, dist=1 */ sw_conf_1210: nop route $IN->$OUT, $PREV>$NEXT nop >$NEXT route $IN->$OUT, $PREVroute $IN->$OUT, $PREV-

Results

Implementation • Raw Router was tested in a cycle-accurate simulator of the Raw processor • Raw prototype clock speed is assumed to be 250 MHz • The focus of research is on switch fabric, NOT on route lookup, etc. • Over 75, 000 lines of assembly code, many of them hand-coded

Raw Router Results • Features – 4 -port edge router – 3. 3 Mpps – 26. 9 Gbps – Uses Raw static networks to stream data

Conclusion

Conclusion • Implemented a gigabit switch on Raw • Mapped dynamic communication to static interconnect • Can intermix switch fabric with computation • High-bandwidth I/O allows performance of custom ASIC processors

Future Work + Critique • • Take advantage of dynamic networks Implement IP route lookup Add computation on data (encryption) Add support of multicast traffic Implement Quality of Service Add virtual output queueing Explore larger router configurations

End of the “official” part!

Current Research • Probabilistic Robotics with Prof. John Leonard • Robust Feature-Relative Navigation for Autonomous Underwater Vehicles

Robotic Kayaks

Questions?
- Slides: 48