Packet Switching on Raw Research Qualifying Exam Gleb
Packet Switching on Raw Research Qualifying Exam Gleb A Chuvpilo January 28, 2005
Project Publications • High-Bandwidth Packet Switching on the Raw General-Purpose Architecture, Gleb A. Chuvpilo and Saman Amarasinghe In Proceedings of the International Conference on Parallel Processing (ICPP-03), Kaohsiung, Taiwan, Republic of China, October 6 -9, 2003. • High-Bandwidth Packet Switching on the Raw General-Purpose Architecture, Gleb A. Chuvpilo, S. M. Thesis, Massachusetts Institute of Technology, Cambridge, Massachusetts, August, 2002. • Raw. Net: Network Processing on the Raw Processor, David Wentzlaff, Gleb A. Chuvpilo, Arvind Saraf, Saman Amarasinghe, and Anant Agarwal, In Research Abstracts of the MIT Laboratory for Computer Science, Cambridge, Massachusetts, March 2002. • Gigabit IP Routing on Raw, Gleb A. Chuvpilo, David Wentzlaff, and Saman Amarasinghe, In Proceedings of the 1 st HPCA Workshop on Network Processors, Cambridge, Massachusetts, February 3, 2002. • Also, unpublished work on Network Calculus at the Computer Engineering and Networks Laboratory of the ETH Swiss Federal Institute of Technology
Outline • Introduction – Raw Processor Overview – Internet Router Overview • Packet Switching on Raw – – Raw Router Architecture Rotating Crossbar Design for Switch Fabric Distributed Scheduling Algorithm Minimization and Scheduling • Results • Conclusion
Introduction
Goal • Build an IP router on a general-purpose processor • Why? – Flexibility new protocols and services – Price economies of scale
Raw
Raw Processor • A scalable computation fabric – 4 x 4 mesh of tiles, each tile is a RISC microprocessor • Ultra fast interconnect network – Exposes the wires to the compiler – Compiler orchestrates the communication
Raw Facts • Performance – 16 OPS/FLOPS per cycle – 230 Gb/s of on-chip “bisection bandwidth” – 201 Gb/s off-chip I/O bandwidth – 57 GB/s of on-chip memory bandwidth
Raw Facts • Layout – Longest wire is the length of tile fast clocking – Each tile: • • MIPS R 4000 + router + interconnect 32 KB IMEM 32 KB data cache 64 KB SMEM 2 MB total per chip
Raw Facts • Instruction Set Architecture – Eight stage pipeline: FETCH, DECODE, RF/STALL, EXE, MUL, MEM, FPU – MIPS instruction set – 28 general-purpose registers – 4 register-mapped network ports – 2 -way set-associative cache, 3 cycle latency, 32 byte lines
Raw Facts • Implementation – ASIC @ 250 MHz Worst Case – 122 million transistors (P 4: 43 million) – 18. 2 mm x 18. 2 mm die (P 4 : 15 mm x 15 mm) – 1080 signal I/O pins – 25 Watts – IBM SA-27 E 6 layer metal copper 0. 15μ process (P 4: 0. 13μ)
Raw Layout
Communication Mechanisms • 2 static networks • 2 dynamic networks
Static Networks • • Destinations known at compile time Message size known at compile time Cycle-by-cycle switch schedule Three-cycle nearest neighbor send-to-use latency • No processing overhead
Static Network: Send
Static Network: Receive
Dynamic Networks • Unpredictable events – External asynchronous interrupts – Cache misses • 15 - to 30 -cycle nearest neighbor send-to-use latency (message header processing overhead) • Wormhole routed, two-stage pipelined, dimension-ordered
Routing
What is Routing? RM OSI…
IP Router Network Processor Forwarding Engine Interface Forwarding Engine Switch Fabric Interface
Switch Fabric
Click Modular Router • Modular software router • MIT Parallel and Distributed OS Group • 435, 000 64 -byte packets a second on a 700 MHz Pentium III (commodity hardware) • Flexible, configurable, and easy to understand • Interconnected collection of modules called elements
Click Modular Router
Packet Switching on Raw
Problem: Four Networks… 2 1 4 3
… and Sixteen Tiles:
What is the Mapping? Dynamic Communicati on Static Interconnect
Solution: Rotating Crossbar Out 0 Lookup Processor Out 1 Egress Processor PORT 0 Ingress Processor Lookup Processor PORT 1 Crossbar Processor Ingress Processor In 1 Ingress Processor In 2 ROTATING CROSSBAR In 3 Ingress Processor Crossbar Processor PORT 3 Lookup Processor PORT 2 Egress Processor Out 3 Egress Processor Out 2 Lookup Processor
Switch Fabric Design • The idea of a Token Ring network absolute fairness • Algorithm uses two static networks, dynamic networks are idle • All deadlock-free configurations are scheduled at compile time • Four headers and token location define a global configuration • Global configuration is computed in a distributed manner at run time
Rotating Crossbar Illustrated Lookup Processor Egress Processor PORT 0 Ingress Processor Lookup Processor PORT 1 Crossbar Processor Ingress Processor ROTATING CROSSBAR Ingress Processor Crossbar Processor PORT 3 Lookup Processor Ingress Processor PORT 2 Egress Processor Lookup Processor
Rotating Crossbar Illustrated Lookup Processor Egress Processor PORT 0 Ingress Processor Lookup Processor PORT 1 Crossbar Processor Ingress Processor ROTATING CROSSBAR Ingress Processor Crossbar Processor PORT 3 Lookup Processor Ingress Processor PORT 2 Egress Processor Lookup Processor
Phases of the Algorithm TILE PROCESSOR SWITCH PROCESSOR headers_reque st headers send_prev_con fig choose_new_config route_body update_token confirm
Distributed Scheduling Algorithm • Let’s enumerate the number of configurations: SPACE = |Hdr 0| x … x |Hdr 3| x |Token|, where |Hdr 0| = … = |Hdr 3| = 5, and |Token| = 4 therefore SPACE = 54 x 4 = 2, 500 distinct configurations
So What? . . . • Each tile has 8, 192 words of instruction memory, same for switch 8, 192/2, 500 = 3. 3 instructions per configuration not enough! need to use off-chip memory slow! need to minimize SPACE
Minimization Egress Processor PORT 0 Ingress Processor in out cwnext Crossbar Processor ccwprev ccwnext Crossbar Processor
Clients and Servers of a Crossbar Processor servers clients out cwnext ccwnext in cwprev ccwprev
Minimization and Scheduling • We cut down the number of configurations by 78 times! Now there are only 32 entries! the program can fit in the local instruction memory! • Code generated by an automatic compiletime scheduler • In addition, software pipelining + loop unrolling of the assembly code of the switch processors of the crossbar to avoid deadlock
Scheduler Output /* AUTOGENERATED SCHEDULE FOR PORT 0 */ /* Tile Processor */ /* …*/ conf_1_0303: mtsri SW_PC, %lo(sw_conf_1000) j conf_done conf_1_0304: mtsri SW_PC, %lo(sw_conf_1000) j conf_done conf_1_0310: mtsri SW_PC, %lo(sw_conf_2001) j conf_done conf_1_0311: mtsri SW_PC, %lo(sw_conf_1210) j conf_done /* …*/ /* HAND-CODED SCHEDULE FOR PORT 0 */ /* Switch Processor */ /* …*/ /* in->out, prev->next, dist=1 */ sw_conf_1210: nop route $IN->$OUT, $PREV>$NEXT nop >$NEXT route $IN->$OUT, $PREVroute $IN->$OUT, $PREV-
Results
Implementation • Raw Router was tested in a cycle-accurate simulator of the Raw processor • Raw prototype clock speed is assumed to be 250 MHz • The focus of research is on switch fabric, NOT on route lookup, etc. • Over 75, 000 lines of assembly code, many of them hand-coded
Raw Router Results • Features – 4 -port edge router – 3. 3 Mpps – 26. 9 Gbps – Uses Raw static networks to stream data
Conclusion
Conclusion • Implemented a gigabit switch on Raw • Mapped dynamic communication to static interconnect • Can intermix switch fabric with computation • High-bandwidth I/O allows performance of custom ASIC processors
Future Work + Critique • • Take advantage of dynamic networks Implement IP route lookup Add computation on data (encryption) Add support of multicast traffic Implement Quality of Service Add virtual output queueing Explore larger router configurations
End of the “official” part!
Current Research • Probabilistic Robotics with Prof. John Leonard • Robust Feature-Relative Navigation for Autonomous Underwater Vehicles
Robotic Kayaks
Questions?
- Slides: 48