Introduction n Linus Svensson u n D 5

  • Slides: 35
Download presentation
Introduction n Linus Svensson u n D 5, linus@sm. luth. se Åke Östmark u

Introduction n Linus Svensson u n D 5, linus@sm. luth. se Åke Östmark u D 5, ake@sm. luth. se 1

Why We Are Here The architecture of a Network Processor Unit (NPU) n Master’s

Why We Are Here The architecture of a Network Processor Unit (NPU) n Master’s thesis - a joint operation between Luleå University of Technology and Switch. Core AB n 2

Today's Topics n Background u u n NPU (Network Processor Unit) u u n

Today's Topics n Background u u n NPU (Network Processor Unit) u u n Why an NPU? Cons and pros with NPU: s The architecture of our NPU u u n Ethernet and internetworks Switches and routers Design difficulties and design choices The architecture, strengths and weaknesses The big picture u From idea to silicon 3

Ethernet n Most widespread network technology used in LAN (Local Area Network) u u

Ethernet n Most widespread network technology used in LAN (Local Area Network) u u u n 10 Mb/s (Ethernet) 100 Mb/s (Fast Ethernet) 1000 Mb/s (Gigabit Ethernet) Packet switched network u u Host-to-host delivery on the same network Switches forward packets from one section to another using the datagram paradigm 4

Ethernet n Datagram paradigm u u n Packet contains enough information for a switch

Ethernet n Datagram paradigm u u n Packet contains enough information for a switch to forward it correctly I. e. packet contains complete destination address Ethernet packets = frames u In Ethernet the packets are referred to as frames 5

Ethernet Frame Format n Preamble u n 64 bits used for synchronisation Header u

Ethernet Frame Format n Preamble u n 64 bits used for synchronisation Header u u u 48 -bit globally unique destination address 48 -bit globally unique source address 16 -bit type field used for classification 6

Ethernet Frame Format n Body u n 46 -1500 bytes of data CRC u

Ethernet Frame Format n Body u n 46 -1500 bytes of data CRC u 32 -bit CRC (Cyclic Redundancy Check) for error detection 7

Internetworks n Internetwork u Several physical networks combined into one logical internetwork t t

Internetworks n Internetwork u Several physical networks combined into one logical internetwork t t u Also called internet (with lowercase “i”) Most famous is the world spanning Internet (with capital “I”) Host-to-host delivery between different networks 8

Internet Protocol (IP) Most widespread protocol used in internetworks n Routers forward packets from

Internet Protocol (IP) Most widespread protocol used in internetworks n Routers forward packets from one network to another using the datagram paradigm n 9

IP Packet Format n n n 12 bytes of status fields e. g. version,

IP Packet Format n n n 12 bytes of status fields e. g. version, length etc 32 -bit globally unique source address 32 -bit globally unique destination address Optional fields of variable length Body 10

IP Over Ethernet n IP packets are encapsulated in Ethernet frames 11

IP Over Ethernet n IP packets are encapsulated in Ethernet frames 11

Host-To-Host Communication 12

Host-To-Host Communication 12

Devices n Switch. Core CXE-2010 u u u A 16 -port Gigabit Ethernet Switch-on-a-chip

Devices n Switch. Core CXE-2010 u u u A 16 -port Gigabit Ethernet Switch-on-a-chip Full 4 K VLAN support Includes support of IEEE 802. 1 p n Cisco 1710 u u u Security Access Router Secure Internet, intranet, and extranet access with VPN and firewall Advanced Qo. S features 13

Features n What if we want: u Load Balancing t u distributing client requests

Features n What if we want: u Load Balancing t u distributing client requests across multiple servers Multi-Protocol Label Switching (MPLS) t next hop based on a the label 14

Features n What if we don’t want u u n Qo. S Security features

Features n What if we don’t want u u n Qo. S Security features The Network Processor Unit (NPU) u u A programmable CPU chip that is optimized for networking and communications functions Quick adaptation of new standards/features 15

Conditions For the Work n n n 1 GE (1000 Mbit) port 8 FE

Conditions For the Work n n n 1 GE (1000 Mbit) port 8 FE (100 Mbit) ports Scalable u u Add more ports Remove ports t Feasible to make an ASIC prototype 16

n NPU components: u u u u Processor Core Embedded software Network Interface Packet

n NPU components: u u u u Processor Core Embedded software Network Interface Packet buffers Queues Tables Switch fabric 17

Design Choices n Processor core u u n RISC based Network specific Network Interface

Design Choices n Processor core u u n RISC based Network specific Network Interface u FE t t u MII (Media Independent Interface) RMII (Reduced MII) GE t t GMII (Gigabit MII) RGMII (Reduced GMII) 18

Design Choices n Queues u n Tables u n A packet ready for transmission

Design Choices n Queues u n Tables u n A packet ready for transmission Data structure for IP & MAC addresses Switch fabric u The internal interconnect architecture. How to transport from in-port to out-port? 19

Design Choices n Packet buffers u u Internal and/or external How many times do

Design Choices n Packet buffers u u Internal and/or external How many times do we need to access a (buffer) memory? t t t Write when receive from network Read packet for processing Write modified packet for transmission Reading the packet when transmitting For N ports the memory needs to run at 4 N the port speed 20

Design Choices u 8 FE ports 1 GE port u Inter-arrival time: u t

Design Choices u 8 FE ports 1 GE port u Inter-arrival time: u t t u 1. 5*106 + 8*1. 55 = 2. 7*106 packets/s -> New packet every 370 ns Cycle budget example: t t 100 MHz -> 37 cycles to process every packet 200 MHz -> 74 cycles to process every packet 21

Design Choices n Model of operation u u u n Route processing Packet forwarding

Design Choices n Model of operation u u u n Route processing Packet forwarding ~200 cycles Special services Target technology u ~150 MHz 22

Design Decisions Parallel Processor Architecture u u u 2 FE ports 125 MHz 1

Design Decisions Parallel Processor Architecture u u u 2 FE ports 125 MHz 1 Integer Unit u u u 1 GE port 125 MHz 5 Integer Units -> Cycle budget of 420 for each packet u u u Interactive voice can tolerate somewhere between 100 and 200 milliseconds of end-to-end delay without people noticing it. 420 cycles -> 0. 00336 ms 23

Design Decisions n Tables u u u MAC Address lookup, fixed length: CAM (Content

Design Decisions n Tables u u u MAC Address lookup, fixed length: CAM (Content Addressable Memory) t Pros: Fast t Cons: Expensive t Like a cache IP Address lookup, longest match: t Possibly large table t External SRAM 24

Internal packet buffers: u Pros: t Fast, less pin count Cons: Limited size of

Internal packet buffers: u Pros: t Fast, less pin count Cons: Limited size of memory t 2 FE ports / 1 buffer u Pros: t Reduce contention, reduce 4 N problem Cons: Less effective use of memory t 25

Virtual output queues: u Pros: t No Head Of Line (HOL) blocking, Possible to

Virtual output queues: u Pros: t No Head Of Line (HOL) blocking, Possible to select any packet from buffer memory Cons: Expensive in hardware t 26

NPU Architecture 27

NPU Architecture 27

28

28

29

29

Performance 30

Performance 30

Strengths in the Architecture n More bandwidth u u n More RU and TU

Strengths in the Architecture n More bandwidth u u n More RU and TU New types of RU and TU More processing power u u More PU per RU/TU More IU per PU New types of IU 31

Strengths in the Architecture n New functionality u New types of shared resources t

Strengths in the Architecture n New functionality u New types of shared resources t t u Semaphores Multipurpose CPU New software t All IU: s can run different software 32

Weaknesses in the Architecture n Not everything scales well u u Shared resources No.

Weaknesses in the Architecture n Not everything scales well u u Shared resources No. of IU: s in a PU 33

From Idea to Silicon n ASIC design flow 34

From Idea to Silicon n ASIC design flow 34

Layout ALU : process(alu_Reg. A, alu_Reg. B, In_Ctrl_Ex) begin case In_Ctrl_Ex. OP is when

Layout ALU : process(alu_Reg. A, alu_Reg. B, In_Ctrl_Ex) begin case In_Ctrl_Ex. OP is when ALU_ADD => alu_Result <= alu_Reg. A + alu_Reg. B; when ALU_SUB => alu_Result <= alu_Reg. A - alu_Reg. B; when ALU_AND => alu_Result <= alu_Reg. A and alu_Reg. B; when ALU_OR => alu_Result <= alu_Reg. A or alu_Reg. B; when ALU_XOR => alu_Result <= alu_Reg. A xor alu_Reg. B; when ALU_NOR => alu_Result <= alu_Reg. A nor alu_Reg. B; when others => alu_Result <= (others => '-'); end case; end process; 35