Lecture on High Performance Processor Architecture CS 05162

  • Slides: 116
Download presentation
Lecture on High Performance Processor Architecture (CS 05162) TLP Architecture Case Study: Network Processors

Lecture on High Performance Processor Architecture (CS 05162) TLP Architecture Case Study: Network Processors An Hong han@ustc. edu. cn Fall 2007 University of Science and Technology of China Department of Computer Science and Technology CS of USTC AN Hong

Outline n NP Overview − What − NP Functions, Objects, Evolution, Speeds n Network

Outline n NP Overview − What − NP Functions, Objects, Evolution, Speeds n Network Processor Applications, Workload, and Benchmark − Categorization: Control and data planes − Characteristics − Requirements − Benchmarks n NP Architecture Modeling and Simulating 2021/12/27 CS of USTC AN Hong 2

Outline n NP Architecture Case Study − Overview current products − Special Purpose Hardware

Outline n NP Architecture Case Study − Overview current products − Special Purpose Hardware Comparison − Pipelining Model Architecture − Multiprocessing Model Architecture n NP Architecture Characteristics and Core Technologies − Key characteristics of the NP architecture − Architectural approaches − ISA − Parallel − Memory − Programming Model 2021/12/27 CS of USTC AN Hong 3

NP Overview 2021/12/27 CS of USTC AN Hong 4

NP Overview 2021/12/27 CS of USTC AN Hong 4

NP Overview n What: Network Processor(NP) is a programmable device that has been designed

NP Overview n What: Network Processor(NP) is a programmable device that has been designed and highly optimized to perform networking functions. n NP Functions:Specially for network applications − Pattern matching(lookup addresses, bit-wise) − Data manipulation(TTL, CRC, SAR) − Queue and Buffer Management (Qo. S, rate, priority, To. S) − Statistics Gathering n NP Objects − Replace expensive ASIC in network device buildup − Provide platform solutions through programmability − Extending product life time through software update 2021/12/27 CS of USTC AN Hong 5

NP Overview n NP evolution − GPP(General-purpose Processor) l Programmable, Not optimized for networking

NP Overview n NP evolution − GPP(General-purpose Processor) l Programmable, Not optimized for networking applications − ASIC(Application Specific Integrated Circuit) l High processing capacity, High design complexity, long time to develop, Lack the flexibility) − NP(Network Processor) l ASIC’s performance + GPP’s flexibility l Cheaper than GPP Based Flexibility l ~30 companies offering network processors; 350 design wins d e as P B N ASIC Based 2021/12/27 Performance CS of USTC AN Hong 6

History of Packet Processing n The Classic Router − Centralized CPU router architecture Router

History of Packet Processing n The Classic Router − Centralized CPU router architecture Router interface card 2021/12/27 Fabric/ I/O Backplane/bus PHY MAC/ framer Host processor card RISC CPU Fabric I/O CS of USTC AN Hong as host processor and Packet processor Memory 7

History of Packet Processing n Emergence of Fast and Slow Path Processing − Distributed

History of Packet Processing n Emergence of Fast and Slow Path Processing − Distributed CPU router architecture Router interface card RISC CPU as host processor Memory 2021/12/27 Fabric/ I/O Backplane/bus PHY MAC/ framer Host processor card CS of USTC AN Hong Fabric I/O RISC CPU as host processor 8

History of Packet Processing n Hybridization of Routers and Switches − Layer-2 switch based

History of Packet Processing n Hybridization of Routers and Switches − Layer-2 switch based on distributed packet processing using ASICs Switch interface card ASIC as host processor Memory 2021/12/27 Fabric/ I/O Backplane/bus PHY MAC/ framer Host processor card CS of USTC AN Hong Fabric I/O RISC CPU as host processor 9

NP Overview n Why GPP cannot keep up? − Moore’s law can NOT keep

NP Overview n Why GPP cannot keep up? − Moore’s law can NOT keep up with the network processing speed requirement! n NP speeds − 1994 -1996 − 1997 -1999 − 2000 -2001 − 2002 -2003 − 2003 -2005 2021/12/27 OC-3 (155 Mbps, ns) OC-12 (625 Mbps, 640 ns) OC-48 (2. 5 Gbps, 160 ns) OC-192 (10 Gbps, 40 ns) OC-768 (40 Gbps, 10 ns) CS of USTC AN Hong 10

NP Overview n Why ASICs are not the answer? n Four factors preventing ASIC-centered

NP Overview n Why ASICs are not the answer? n Four factors preventing ASIC-centered designs − IP-based protocols are still evolving − Layer-2 protocols are in a greater degree of flux than ever − Increasing Packet Processing Complexity − Time-to-Market Pressures n NP is to address such a need − Time to market(TTM) − Time in market(TIM) − Expanded functionality − Leverage third-party development of applications 2021/12/27 CS of USTC AN Hong 11

NP Overview n Where do NPs fit in a system? n A networking device

NP Overview n Where do NPs fit in a system? n A networking device can be broken down into four overall functions: − Host processing − PHY(physical) layer processing − Switching − Packet processing l Framing l Parsing/Classification l Modification l Encryption/compression l Queuing 2021/12/27 CS of USTC AN Hong 12

NP Overview n Packet processing architecture Host processing(slow path and/or control functions) Packet processing

NP Overview n Packet processing architecture Host processing(slow path and/or control functions) Packet processing PHY layer 2021/12/27 Framing Classification Modification Encryption/ Queuing compression CS of USTC AN Hong Switching 13

Components of a Generic Router 2021/12/27 CS of USTC AN Hong 14

Components of a Generic Router 2021/12/27 CS of USTC AN Hong 14

NP Overview n NP in a router application Line interface, conditioning, framing NP Switch

NP Overview n NP in a router application Line interface, conditioning, framing NP Switch Line card Other line cards Memories, CAMs, special functions Host control processor 2021/12/27 CS of USTC AN Hong 15

Packet Processing in an IP router 1. Accept packet arriving on an incoming link.

Packet Processing in an IP router 1. Accept packet arriving on an incoming link. 2. Lookup packet destination address in the forwarding table to identify outgoing port(s). 3. Edit packet header: e. g. , decrement TTL, update header checksum. 4. Send packet to the outgoing port(s). 5. Buffer packet in the queue. 6. Transmit packet onto outgoing link. 2021/12/27 CS of USTC AN Hong 16

Another View of an IP Router Routing Protocols Routing Table Forwarding Switching Table 2021/12/27

Another View of an IP Router Routing Protocols Routing Table Forwarding Switching Table 2021/12/27 CS of USTC AN Hong Control Plane Datapath per-packet processing 17

Packet Forwarding Engine Packet payload header Router Destination Address Routing Lookup Data Structure Outgoing

Packet Forwarding Engine Packet payload header Router Destination Address Routing Lookup Data Structure Outgoing Port Forwarding Table Dest-network 2021/12/27 Port 65. 0. 0. 0/8 3 128. 9. 0. 0/16 1 149. 12. 0. 0/19 7 CS of USTC AN Hong 18

Number of Prefixes Size of the Forwarding Table 10, 000/year 95 96 97 Year

Number of Prefixes Size of the Forwarding Table 10, 000/year 95 96 97 Year 98 99 00 Source: http: //www. telstra. net/ops/bgptable. html 2021/12/27 CS of USTC AN Hong 19

Lookup Rate Required n 应用对网络处理器的性能要求(平均包大小设为典型值 64字节) 2021/12/27 CS of USTC AN Hong 20

Lookup Rate Required n 应用对网络处理器的性能要求(平均包大小设为典型值 64字节) 2021/12/27 CS of USTC AN Hong 20

Performance Estimation n 10 Gbps Core Router − Functions: transport packets @ OC-192 −

Performance Estimation n 10 Gbps Core Router − Functions: transport packets @ OC-192 − Running @ 200 Mhz = 200 MIPS − Assumption: 1 MIPS for 1 MBits I/O and 1 Mbytes Mem. n Estimation: − #u. P = 10 G/200 = 50 !!! − Memory: 10 GBytes !!! n Solutions: − Coprocessors: l. IP forwarding , Classification, and CRC and checksum − Multithreading − Memory hierarchy 2021/12/27 CS of USTC AN Hong 21

NP Design Challenges n As GPP and ASIC − − − External memory bandwith

NP Design Challenges n As GPP and ASIC − − − External memory bandwith Power dissipation Pin limitations Packaging Verification n NP special − Line speed l Real-time, link-rate processing − Application complexity l Applications that operate on individual packet headers(e. g. , routing and forwarding) l Applications that operate principally on individual packet payloads(e. g. , transcoding) l Applications that operate across multiple packets within a single flow(e. g. , certain encryption algorithms) or across multiple flows(e. g. , Qo. S and traffic shaping). A “flow” is considered to be a single source-destination session 2021/12/27 CS of USTC AN Hong 22

NP Design Challenges n Other NP special − Port density − High-level of device

NP Design Challenges n Other NP special − Port density − High-level of device integration(on-chip interfaces and controllers for external memories, switch fabrics, co-processors, network interfaces, etc. ) − Management of critical shared resources in a chipmultiprocessor environment(e. g. , shared program state, memory interfaces); − Compiler and software design for high-performance, real-time, parallel, and heterogeneous systems − Real-time system verification 2021/12/27 CS of USTC AN Hong 23

NP Design Techniques n Application-specific Architectures − Extending the RISC instruction set − Use

NP Design Techniques n Application-specific Architectures − Extending the RISC instruction set − Use of customized on-chip or off-chip hardware assists n Parallelism − Thread-level parallelism − Instruction-level parallelism n Microarchitectures − Multiple processors − Pipelined processors 2021/12/27 CS of USTC AN Hong 24

NP Application, Workload, and Benchmark 2021/12/27 CS of USTC AN Hong 25

NP Application, Workload, and Benchmark 2021/12/27 CS of USTC AN Hong 25

Application n Need to understand applications before understanding “application-specific” devices n Kernels − Control

Application n Need to understand applications before understanding “application-specific” devices n Kernels − Control processing: Encompasses a large number of different tasks that usually do not need to be performance at wire speed. − Pattern matching: Header parsing − Packet classification: indentification of the packet type and attributes − Lookup: based on a key to find a specific entry in a table − Data manipulation: modifies the packet header − Field computation: Chechsum, CRC, time-to-live field decrement, data encryption − Queue management: Scheduling and storage of incoming and outgoing packet data units 2021/12/27 CS of USTC AN Hong 26

Application Categorization n NP Applications − Carrier-class metro and core − Multi-service edge and

Application Categorization n NP Applications − Carrier-class metro and core − Multi-service edge and access network − Enterprise and Ethernet edge − Storage Networks − Network Security n NP Application system − Routers − Switches − Firewalls − …… 2021/12/27 CS of USTC AN Hong 27

Application Categorization n Tasks and services − Routing table lookup l Determine the next

Application Categorization n Tasks and services − Routing table lookup l Determine the next hop for incoming packets − Packet Classification l classify packets using header fields against a set of rules − URL-based Switching l Distribute HTTP requests based on URLs − Transcoding l Encryption/Decryption, intrusion detection, firewall, access control checking, denial-of-service 2021/12/27 CS of USTC AN Hong 28

Application Categorization Processing Tasks Policy Applications Control Plane Network Management Signaling n All tasks

Application Categorization Processing Tasks Policy Applications Control Plane Network Management Signaling n All tasks required for control and manament of the NPU. For example − Tables maintenance(classification tables, routing tables, Qo. S tables. . . ) − Ports state − Timing & signaling to all components: Pes, switch-fabric, Queues… Topology Management Data Plane Queuing / Scheduling n. Traffic management Data Transformation n. Transformation of packet data between layers(protocols) Classification Data Parsing Media Access Control 2021/12/27 Physical Layer −queuing, scheduling & Policing n. Identify packets aginst a criteria: flow, Qo. S … n. Parsing packets heather to extract protocol information n. Low-level protocol implementation: Ethernet, CS of USTC AN Hong. ATM… 29

Application Categorization n Control-Plane tasks − Less time-critical − Control and management of device

Application Categorization n Control-Plane tasks − Less time-critical − Control and management of device operation l Table maintenance, port states, etc. n Data-Plane tasks − Operations occurring real-time on “packet path” − Core device operations l Receive, process and transmit packets 2021/12/27 CS of USTC AN Hong 30

Data Plane Tasks n Media Access Control − Low-level protocol implementation l Ethernet, SONET

Data Plane Tasks n Media Access Control − Low-level protocol implementation l Ethernet, SONET framing, ATM cell processing, etc. n Data Parsing − Parsing cell or packet headers for address or protocol information n Classification − Identify packet against a criteria (filtering / forwarding decision, Qo. S, accounting, etc. ) n Data Transformation − Transformation of packet data between protocols n Traffic Management − Queuing, scheduling and policing packet data 2021/12/27 CS of USTC AN Hong 31

Data Plane operations -- examples n Priority based Qo. S mechanism − Supports different

Data Plane operations -- examples n Priority based Qo. S mechanism − Supports different levels of Qo. S for each output port − Contains Qo. S policy table prioritizing packets − Ingress operations l Applies Qo. S policy on the packet received l Gets the packet priority from its heather content l Place the packet in the appropriate output queue − Egress operations l Identifies & schedules highest priority packet for transmission l Transmits the identified packet on to the output port − Security l Encryption/Decryption, intrusion detection, access control checking, denial of-service 2021/12/27 CS of USTC AN Hong 32

Data Plane operations -- examples n Monitoring − Capturing usage patterns, time information n

Data Plane operations -- examples n Monitoring − Capturing usage patterns, time information n Load Balancing − Distribution of traffic among servers according to the server load, content and client credentials load, co 2021/12/27 CS of USTC AN Hong 33

Protocol Processing Characteristics n Protocol processing requires intensive memory operations. Memory speed determines the

Protocol Processing Characteristics n Protocol processing requires intensive memory operations. Memory speed determines the system performance. n Protocol processing requires powerful bit manipulation. n Layer 2 -4 protocols require error detection (Computation). e. g. CRC and checksum n Multi-service (multi-protocol) coexist 2021/12/27 CS of USTC AN Hong 34

Packet Application Characteristics n Packet coverage: – Header only, or Header + Payload n

Packet Application Characteristics n Packet coverage: – Header only, or Header + Payload n Packet parsing: – Is the data location known/static? n Qo. S − Classification and Queuing n States are maintained between packets – Statefull analysis 2021/12/27 CS of USTC AN Hong 35

IPv 4 Routing table lookup P A P P B Router C n Routers

IPv 4 Routing table lookup P A P P B Router C n Routers determine next hop and forward packets 2021/12/27 CS of USTC AN Hong 36

Routing Table Lookup is a Searching Extensive Task n Search operation is not an

Routing Table Lookup is a Searching Extensive Task n Search operation is not an exact match − Direct lookup needs 4 G entries (32 bits IP address) − Longest prefix match l Tries l Hashing table l Balanced tree 2021/12/27 CS of USTC AN Hong 37

Trie-based Routing Table Lookup rt_ptr Trie block trie_ptr 0 Next hop 3 15 Next

Trie-based Routing Table Lookup rt_ptr Trie block trie_ptr 0 Next hop 3 15 Next hop 1 Next hop 2 Next hop 4 Prefix Netmask Next hop 0010 0 x. F 1 0001, 0011 0 x. FF 2 1110, 1110 0 x. FF 2 0001, 0011, 0000 0 x. FFF 3 0001, 0011, 111 0 x. FFFE 4 n Trie block keeps pointers to route entry and other trie blocks n Destination IP address bits are examined group by group (4 -bit) 2021/12/27 CS of USTC AN Hong 38

Example rt_ptr trie_ptr 0 Next hop 3 15 Next hop 1 Next hop 2

Example rt_ptr trie_ptr 0 Next hop 3 15 Next hop 1 Next hop 2 Next hop 4 Prefix Netmask Next hop 0010 0 x. F 1 0001, 0011 0 x. FF 2 1110, 1110 0 x. FF 2 0001, 0011, 0000 0 x. FFF 3 0001, 0011, 111 0 x. FFFE 4 Packet destination IP address =0 x 13 fe 2233 (0001, 0011, 1110, …) 2021/12/27 CS of USTC AN Hong 39

IP Lookups using Multi-way Multi-column Search Illustration of the idea with 6 -bit address

IP Lookups using Multi-way Multi-column Search Illustration of the idea with 6 -bit address Prefixes: 1* 10101* 100000 10101011 101110 111110 Binary search does not work with variable length strings. Encoding prefixes as ranges 1* -> [100000, 111111] 101* -> [101000, 101111] 10101* -> [101010, 101011] 100000 10101011 101110 10111110 111111 1) end up far away from the matching prefix 2) Multiple addresses that match to different prefix, end up in the same region 2021/12/27 CS of USTC AN Hong L L L H H L L, H H H Narrowest enclosing Range containing A A 40

Multi-way multi-column search k 3 k 6 P 1: 1* -> [100000, 111111] P

Multi-way multi-column search k 3 k 6 P 1: 1* -> [100000, 111111] P 2: 101* -> [101000, 101111] P 3: 10101* -> [101010, 101011] Any region in the binary search between two consecutive numbers corresponds to a unique prefix p 1) 1 0 0 0 p 2) 1 0 0 0 p 3) 1 0 1 0 10101111 111111 > = p 1 p 2 p 3 p 1 p 2 - p 1 k 2 info k 4 k 5 info k 7 k 8 info Multi-column search is used for wide address such as IPv 6 Probe 1 2021/12/27 3 -way tree for 8 keys A A A B B B 2 CS of USTC AN Hong C D M M N N 3 E C W W X Y W/M words 41

Packet forwarding tasks n Header parsing − This consists of pattern matching of bits

Packet forwarding tasks n Header parsing − This consists of pattern matching of bits in the header field n Packet classification − Identification of the packet type (e. g. IP, MPLS, ATM)and attributed(e. g. quality of service requirement, encryption type ) n Lookup Consists of looking up data based on a key. It is mostly used in conjunction with pattern matching to find a specific entry in a table n Computation − This varies widely by application. Examples include checksum, CRC, time-to-live field decrement, and data encryption 2021/12/27 CS of USTC AN Hong 42

Packet forwarding tasks n Data manipulation − Any function that modifies the packet header

Packet forwarding tasks n Data manipulation − Any function that modifies the packet header n Queue management − Scheduling and storage of incoming and outgoing packet data units n Control processing − Encompasses a large number of different tasks that usually do not need to be performed at wire speed. These are usually performed on a standard RISC processor linked to the NPU. 2021/12/27 CS of USTC AN Hong 43

Packet Classification n Routers are required to distinguish packets for − − − Flow

Packet Classification n Routers are required to distinguish packets for − − − Flow identification Fair sharing of bandwidth Qo. S Security Accounting, billing etc n Packets are classified by rules − Src IP, Dest IP, src port #, dest port # etc n Classification Algorithm Metrics − − − Search speed Storage cost Scalability Updates Etc. 2021/12/27 CS of USTC AN Hong 44

Classification: Hierarchical tries rule F 1 F 2 R 1 00* R 2 0*

Classification: Hierarchical tries rule F 1 F 2 R 1 00* R 2 0* 01* R 3 1* 0* R 4 00* 0* R 5 0* 1* R 6 * 1* Pankaj Gupta and Nick Mc. Keown, "Algorithms for Packet Classification", IEEE Network Special Issue, March/April 2001, vol. 15, no. 2, pp 24 -32. n Extension of the one dimensional radix trie n Construct trie recursively: − Contruct F 1 -trie on the set of prefix {Rj 1} − For each prefix p in F 1 -trie, we recursively construct (d-1) dimensional hierarchical trie on rules where {Rj: Rj 1=p} 2021/12/27 CS of USTC AN Hong 45

Classification: Bitmap-intersection The set of rules S that a packet matches is the intersection

Classification: Bitmap-intersection The set of rules S that a packet matches is the intersection of d sets, Si Where Si is the set of rules that match the packet in the i-th dimension alone. 0 0 Pankaj Gupta and Nick Mc. Keown, "Algorithms for Packet Classification", IEEE Network Special Issue, March/April 2001, vol. 15, no. 2, pp 24 -32. 2021/12/27 CS of USTC AN Hong 46

URL-based switching www. yahoo. com Internet Image Server IP TCP APP. DATA Application Server

URL-based switching www. yahoo. com Internet Image Server IP TCP APP. DATA Application Server GET /cgi-bin/form HTTP/1. 1 Host: www. yahoo. com… Switch HTML Server n Increase efficiency n Tasks − Traverse the packet data (request) for each arriving packet and classify it: l Contains ‘. jpg’ -> to image server l Contains ‘cgi-bin/’ -> to application server Source: Network Processor Tutorial in Micro 34 - Mangione-Smith & Memik 2021/12/27 CS of USTC AN Hong 47

Transcoders n Two important requirements − If the receiver is not capable of interpreting

Transcoders n Two important requirements − If the receiver is not capable of interpreting the stored data (multimedia transcoders) l wireless receivers, hand-held devices, etc. − Compression for bandwidth and storage efficiency Mpeg encoder Corporate Network Video-on-demand server Transcoder Internet Media Player Source: Network Processor Tutorial in Micro 34 - Mangione-Smith & Memik 2021/12/27 CS of USTC AN Hong 48

NP Workloads and Benchmarks n Available: − NPBench l 10 applications − Comm. Bench[4]

NP Workloads and Benchmarks n Available: − NPBench l 10 applications − Comm. Bench[4] l 8 applications l http: //ccrc. wustl. edu/~wolf/cb/ − Net. Bench[3] l 10 applications l http: //cares. icsl. ucla. edu/Net. Bench − Mi. Bench[2] − EEMBC l http: //www. eembc. org/benchmark − Media. Bench l Transcoders l Some communications applications 2021/12/27 CS of USTC AN Hong 49

三个主要的Benchmark 2021/12/27 CS of USTC AN Hong 50

三个主要的Benchmark 2021/12/27 CS of USTC AN Hong 50

3种典型的Benchmark应用程序比较 2021/12/27 CS of USTC AN Hong 51

3种典型的Benchmark应用程序比较 2021/12/27 CS of USTC AN Hong 51

Benchmarking Hierarchy n System level − ? ? ? n Function level − ?

Benchmarking Hierarchy n System level − ? ? ? n Function level − ? ? ? n Microlevel − ? ? ? n Hardware level − ? ? ? 2021/12/27 CS of USTC AN Hong 52

NP Workloads and Benchmarks n Several weak points: − no consideration for interfaces −

NP Workloads and Benchmarks n Several weak points: − no consideration for interfaces − assuming traditional programming model n Metrics: − Processing time − Throughput − Memory latency − …. . . 2021/12/27 CS of USTC AN Hong 54

NP Architecture Modeling and Simulating 2021/12/27 CS of USTC AN Hong 55

NP Architecture Modeling and Simulating 2021/12/27 CS of USTC AN Hong 55

Architectural Comparisons n High-level organizations − Aggressive superscalar (SS) − Fine-grained multithreaded (FGMT) −

Architectural Comparisons n High-level organizations − Aggressive superscalar (SS) − Fine-grained multithreaded (FGMT) − Chip multiprocessor (CMP) − Simultaneous multithreaded (SMT) 2021/12/27 CS of USTC AN Hong 56

Time (processor cycle) Architectural Comparisons (cont. ) Superscalar 2021/12/27 Simultaneous Fine-Grained. Coarse-Grained. Multiprocessing Multithreading

Time (processor cycle) Architectural Comparisons (cont. ) Superscalar 2021/12/27 Simultaneous Fine-Grained. Coarse-Grained. Multiprocessing Multithreading Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot CS of USTC AN Hong 57

Performance Evaluation Forwarding: IP Forward Authentication: MD 5 Encryption: 3 DES SS FGMT CMP

Performance Evaluation Forwarding: IP Forward Authentication: MD 5 Encryption: 3 DES SS FGMT CMP SMT • Workloads have little ILP • Need to exploit packet-level parallelism • CMP and SMT do just that 2021/12/27 n Systems must support some form of concurrent packet-level parallelism n SMT and CMP are nearly equivalent, with SMT always coming out ahead CS of USTC AN Hong 58

NP Architecture: SS / FGMT / CMP / SMT [10] 2021/12/27 CS of USTC

NP Architecture: SS / FGMT / CMP / SMT [10] 2021/12/27 CS of USTC AN Hong 59

NP Architecture: So. C: CMP + SMT + cluster [13] 2021/12/27 CS of USTC

NP Architecture: So. C: CMP + SMT + cluster [13] 2021/12/27 CS of USTC AN Hong 60

NP Architecture: So. C: CMP + SMT + cluster [13] n Goal: maximize IPS/area

NP Architecture: So. C: CMP + SMT + cluster [13] n Goal: maximize IPS/area within pin count limit n Performance Models: – IPS : m, n, clkp, p – area : m, n, Smchl, Sp, Sci, Scd – p : t, pmiss, mem ( mchl) – Smchl : bwmchl – bwmchl : p, clkp, , linesize, pmiss – n : widthmchl, clkmchl, bwmchl – bw. IO : IPS, compl. , IO 2021/12/27 CS of USTC AN Hong 61

NP Architecture Case Study 2021/12/27 CS of USTC AN Hong 62

NP Architecture Case Study 2021/12/27 CS of USTC AN Hong 62

Network Processor Companies n Agere n EZchip n Alchemy n Entridia Corporation n Allayer

Network Processor Companies n Agere n EZchip n Alchemy n Entridia Corporation n Allayer n IBM n Applied Micro Circuits n IP Semiconductors A/S n (MMC Networks) n Bay Microsystems n Brecis Communications n Broadcom (Si. Byte) n Cisco n Clear. Speed n Intel n ishoni Networks n Lexra n Motorola (C-Port) n Navarro Networks n Clearwater Networks n Onex Communications n Cognigine n PMC-Sierra (QED) n Conexant/Mindspeed n Vitesse (Sitera) n (Maker) 2021/12/27 CS of USTC AN Hong 63

Overview of Current Product 2021/12/27 CS of USTC AN Hong 64

Overview of Current Product 2021/12/27 CS of USTC AN Hong 64

Map of NP Market 2021/12/27 CS of USTC AN Hong 65

Map of NP Market 2021/12/27 CS of USTC AN Hong 65

Map of NP Market 2021/12/27 CS of USTC AN Hong 66

Map of NP Market 2021/12/27 CS of USTC AN Hong 66

Architectural Diversity 2021/12/27 CS of USTC AN Hong 67

Architectural Diversity 2021/12/27 CS of USTC AN Hong 67

Performance Diversity 2021/12/27 CS of USTC AN Hong 68

Performance Diversity 2021/12/27 CS of USTC AN Hong 68

Special Purpose Hardware Comparison 2021/12/27 CS of USTC AN Hong 69

Special Purpose Hardware Comparison 2021/12/27 CS of USTC AN Hong 69

Pipelining model: NP-1 EZChip n TOP: Task Optimized Processing cores 2021/12/27 CS of USTC

Pipelining model: NP-1 EZChip n TOP: Task Optimized Processing cores 2021/12/27 CS of USTC AN Hong 70

Agere’s Payload. Plus System 2021/12/27 CS of USTC AN Hong 71

Agere’s Payload. Plus System 2021/12/27 CS of USTC AN Hong 71

Pipelining model (CISCO) 2021/12/27 CS of USTC AN Hong 72

Pipelining model (CISCO) 2021/12/27 CS of USTC AN Hong 72

Pipelining model: Toaster System: Cisco 10000 n Almost all data plane operations execute on

Pipelining model: Toaster System: Cisco 10000 n Almost all data plane operations execute on the programmable XMC n Pipeline stages are assigned tasks – e. g. classification, routing, firewall, MPLS − Classic SW load balancing problem n External SDRAM shared by common pipe stages 2021/12/27 CS of USTC AN Hong 73

Multiprocessor model: IXP 1200 [11] Block Diagram n. Strong. ARM processing core n. Microengines

Multiprocessor model: IXP 1200 [11] Block Diagram n. Strong. ARM processing core n. Microengines introduce new ISA n. I/O − PCI − SDRAM − SRAM − IX : PCI-like packet bus n. On chip FIFOs − 16 entry 64 B each 2021/12/27 CS of USTC AN Hong 74

IXP 1200 Microengine n 4 hardware contexts − Single issue processor − Explicit optional

IXP 1200 Microengine n 4 hardware contexts − Single issue processor − Explicit optional context switch on SRAM access n Registers − All are single ported − Separate GPR − 256*6 = 1536 registers total n 32 -bit ALU − Can access GPR or XFER registers n Shared hash unit − 1/2/3 values – 48 b/64 b − For IP routing hashing n Standard 5 stage pipeline n 4 KB SRAM instruction store – not a cache! n Barrel shifter 2021/12/27 CS of USTC AN Hong 75

Inside IXP 1200 engine [12] 2021/12/27 CS of USTC AN Hong 76

Inside IXP 1200 engine [12] 2021/12/27 CS of USTC AN Hong 76

IXP 2400 Block Diagram n XScale core replaces Strong. ARM n Microengines DDR DRAM

IXP 2400 Block Diagram n XScale core replaces Strong. ARM n Microengines DDR DRAM controller ME 0 ME 1 ME 3 ME 2 Scratch /Hash /CSR − Faster − More: 2 clusters of 4 microengines each n Local memory XScale Core n Next neighbor routes added between microengines PCI n Hardware to accelerate CRC operations and Random number generation QDR SRAM controller 2021/12/27 ME 4 ME 7 ME 5 ME 6 MSF Unit n 16 entry CAM CS of USTC AN Hong 77

Intel IXP 2800 2021/12/27 CS of USTC AN Hong 78

Intel IXP 2800 2021/12/27 CS of USTC AN Hong 78

Multiprocessor model: IBM Power. NP n 16 pico-procesors and 1 power. PC n Each

Multiprocessor model: IBM Power. NP n 16 pico-procesors and 1 power. PC n Each pico-processor − Support 2 hardware threads − 3 stage pipeline : fetch/decode/execute n Dyadic Processing Unit − Two pico-processors − 2 KB Shared memory − Tree search engine n Focus is layers 2 -4 n Power. PC 405 for control plane operations − 16 K I and D caches n Target is OC-48 Ref: [NPT] 2021/12/27 CS of USTC AN Hong 79

Multiprocessor model: C-Port C-5 Chip Architecture 2021/12/27 CS of USTC AN Hong 80

Multiprocessor model: C-Port C-5 Chip Architecture 2021/12/27 CS of USTC AN Hong 80

Multiprocessor model (C-5 e) 2021/12/27 CS of USTC AN Hong 81

Multiprocessor model (C-5 e) 2021/12/27 CS of USTC AN Hong 81

Vitesse PRISM IQ 2000 2021/12/27 CS of USTC AN Hong 82

Vitesse PRISM IQ 2000 2021/12/27 CS of USTC AN Hong 82

Agere’s Fast Pattern Processor 2021/12/27 CS of USTC AN Hong 83

Agere’s Fast Pattern Processor 2021/12/27 CS of USTC AN Hong 83

Agere’s Routing Switch Processor 2021/12/27 CS of USTC AN Hong 84

Agere’s Routing Switch Processor 2021/12/27 CS of USTC AN Hong 84

BRECIS Communications 2021/12/27 CS of USTC AN Hong 85

BRECIS Communications 2021/12/27 CS of USTC AN Hong 85

Cognigine 2021/12/27 CS of USTC AN Hong 86

Cognigine 2021/12/27 CS of USTC AN Hong 86

Cognigine’s Variable Instruction Set Computing 2021/12/27 CS of USTC AN Hong 87

Cognigine’s Variable Instruction Set Computing 2021/12/27 CS of USTC AN Hong 87

Xelerated Packet Devices 2021/12/27 CS of USTC AN Hong 88

Xelerated Packet Devices 2021/12/27 CS of USTC AN Hong 88

Application System architecture design n An Router using IXP 1200 [7] 2021/12/27 CS of

Application System architecture design n An Router using IXP 1200 [7] 2021/12/27 CS of USTC AN Hong 89

Organizing Processor Resources n Design decisions: − High-level organization − ISA and micro architecture

Organizing Processor Resources n Design decisions: − High-level organization − ISA and micro architecture − Memory and I/O integration n Today’s commercial NPs: − Chip multiprocessors − Most are multithreaded − Exploit little ILP (Cisco does) − No cache − Micro-programmed 2021/12/27 CS of USTC AN Hong 90

NP Architecture Characteristics and Core Technologies 2021/12/27 CS of USTC AN Hong 91

NP Architecture Characteristics and Core Technologies 2021/12/27 CS of USTC AN Hong 91

Block Diagram of NP - Data manipulation - CRC computation - Data Buffering -

Block Diagram of NP - Data manipulation - CRC computation - Data Buffering - SAR Data Input 2021/12/27 - Connection manage - Packet classifying - Scheduling - Statistic CS of USTC AN Hong Data Output 92

Typical NP Architecture SDRAM Bus (Packet buffer) SRAM (Routing table) Input ports Bus Output

Typical NP Architecture SDRAM Bus (Packet buffer) SRAM (Routing table) Input ports Bus Output ports multi-threaded processing elements Co-processor Network Processor 2021/12/27 CS of USTC AN Hong 93

NP Chipset 2021/12/27 CS of USTC AN Hong 94

NP Chipset 2021/12/27 CS of USTC AN Hong 94

NP Chipset Network Processor General Purpose Processor (Control Plane Processor) SDRAM(PC) DDR DRAM RLDRAM

NP Chipset Network Processor General Purpose Processor (Control Plane Processor) SDRAM(PC) DDR DRAM RLDRAM FCDRAM RDRAM Multithreaded Processing Elements Input Ports Output Ports 2021/12/27 Fabric Interface Packet Buffer Network Interface Routing Table Co-Processor (Classification) SSRAM DDR SRAM QDR SRAM TCAM DDR DRAM FCRAM Co-Processor (Deep Packet Analysis) Co-Processor (Policing and Statistics) CS of USTC AN Hong 95

Different Types of Memory Type of memory Bytes Size in bytes Approx. latency Special

Different Types of Memory Type of memory Bytes Size in bytes Approx. latency Special operations Local to ME 4 2560 3 Indexed addressing On-chip scratch 4 16 K 60 Atomic & read_and_modify SRAM 4 256 M 150 Atomic & queues DRAM 4 2 G 300 payload 2021/12/27 post incr/decr CS of USTC AN Hong 96

Architectural approaches 2 Major [Orthogonal] Approaches: n Processing Element level − Pipelined: each PE

Architectural approaches 2 Major [Orthogonal] Approaches: n Processing Element level − Pipelined: each PE designed for a particular task(Classification/Queueing/Data modification) l. Aggressive application pipelining (>100) − Parallel: each PE performing same task l. Fine-grained multithreaded l. Simultaneous multithreaded in a processor(SMT) l. Multiprocessor on a chip (CMP) n Functional Unit level − VLIW (Agere RSP, Cisco PXF) − Superscalar (Broadcom, Clearwater Networks) − Parallel execution units (Cognigine) 2021/12/27 CS of USTC AN Hong 97

Special Purpose Hardware n Co-processors − Tightly integrated l Entire network processing kernel: Payload.

Special Purpose Hardware n Co-processors − Tightly integrated l Entire network processing kernel: Payload. Plus’ checksum/CRC engine, C-5 DCP’s queue management, Xelerated’s counters & meters l Algorithm based: IXP 1200’s hash engine − 3 rd party – more co-processor companies as NP companies l Security – encryption, authentication, IPSec l Classification l Lookup – memories mixed with computation, ternary CAMs n Functional Units − Single-cycle operations: find 1 st bit set, − Extract/write byte/word, barrel shift n Interfaces − Memory mapped − Special instructions − Configuration bits 2021/12/27 CS of USTC AN Hong 98

Key characteristics of the NP architecture n Programmable to perform the work they are

Key characteristics of the NP architecture n Programmable to perform the work they are best suited for n Special Optimized instruction set for bit manipulation n Highly parallel architecture to hide memory latency and obtain wire-speed performance n Modular & Scalable to allow network vendors to build small, inexpensive to large, high performance network devices. n Special high speed function units for CRC, checksum, hashing and table lookup algorithms n High bandwidth memory hierarchy to store packets and various tables 2021/12/27 CS of USTC AN Hong 99

NP ISA Design n ISA Categorization − RISC (Intel, IBM, Motorola, Internet Machine) −

NP ISA Design n ISA Categorization − RISC (Intel, IBM, Motorola, Internet Machine) − VLIW (Agere, Motorola) n ISA function − Threads creation and synchronization − Inter-processor communication − Bit manipulation − CRC and checksum − Hashing − Read-and-Modify (Atomic) − Queuing 2021/12/27 CS of USTC AN Hong 100

NP ISA Design n POS: − Closely match the computation patterns of typical application

NP ISA Design n POS: − Closely match the computation patterns of typical application kernels − Reduce the length of critical path: good performance n COS: − multiple ISAs, each targeting different domains of applications − Compiler assist n Example: Crypto. Maniac [14], [15] 2021/12/27 CS of USTC AN Hong 101

Parallelism Hierarchy n NP is fitted for all range of parallelisms (PLP, TLP, FLP,

Parallelism Hierarchy n NP is fitted for all range of parallelisms (PLP, TLP, FLP, MLP, ILP, BLP/DLP) − HW l Instruction level parallelism (ILP) l Thread level parallelism in a PE (TLP) l Multiprocessors on a chip (Task LP) l Examples: • pattern matching (ILP&DLP), • data flows (TLP). − SW l Parallel Instructions l Concurrent Threads l Parallel and Concurrent Tasks l Parallel Algorithms 2021/12/27 CS of USTC AN Hong 102

Thread Granularity n Fine-grain level threads: − A few or a few of tens

Thread Granularity n Fine-grain level threads: − A few or a few of tens instructions − Thread switching within one cycle n Mid-grain level threads: − A few tens or hundreds of instructions − Light-weight threads (P-thread and Java threads) n Coarse-grain level threads − ~K or ~M of instructions − Multiprogramming(Processes) 2021/12/27 CS of USTC AN Hong 103

Packet-Level Parallelism Layers Parallelism TLP Ethernet Packet yes IP Packet yes Flow yes Session

Packet-Level Parallelism Layers Parallelism TLP Ethernet Packet yes IP Packet yes Flow yes Session yes TCP HTTP 2021/12/27 CS of USTC AN Hong 104

Multithreaded Execution Resources n PC (Program Counter) n Registers n Local buffers n Stack

Multithreaded Execution Resources n PC (Program Counter) n Registers n Local buffers n Stack n Non-split transactions (DMA) n Fast thread switching mechanism (single cycle switch? ) n Heap? 2021/12/27 CS of USTC AN Hong 105

Memory hierarchy n CAM(Content Addressable Memory) is a technique for fast locality, so hashing

Memory hierarchy n CAM(Content Addressable Memory) is a technique for fast locality, so hashing becomes possible. − Data stored is no longer continuous in memory space, which reduces the sequential operations. n Hashing − In the tradition, data are stored sequentially in memory. − Hashing data is the key to break sequential and achieve parallel operations. 2021/12/27 CS of USTC AN Hong 106

Hiding Latency n Multi-threading − Hardware for zero-overhead context switching (from 2 -64 threads)

Hiding Latency n Multi-threading − Hardware for zero-overhead context switching (from 2 -64 threads) − Hardware for thread scheduling n Split transaction buses n Memory − Small, specialized blocks − Prefetching 2021/12/27 CS of USTC AN Hong 107

Packet Processing with Threading while (1) { read_packet(); if (not_read) switch_thread(); parsing_packet(); classification(); if

Packet Processing with Threading while (1) { read_packet(); if (not_read) switch_thread(); parsing_packet(); classification(); if (not_ready) switch_thread(); table_lookup; if (not_ready) switch_thread(); editing(); if (not_ready) switch_thread(); queuing(); if (not_ready) switch_thread(); } 2021/12/27 scheduling() if (not_ready) switch_thread(); CS of USTC AN Hong 108

Integrated bus n Data Buses: Many new memory technologies are applied to NP to

Integrated bus n Data Buses: Many new memory technologies are applied to NP to enhance throughput. − SDRAM − Dual-rate DRAM (DDR) − Rambus − EZChip is using 512 bit data bus n Typical NP architecture has some type of integrated bus. n This bus integrates the processor cores, the memory systems, interfaces to the physical adapters and the host system bus. n This integration reduces part count and system complexity, while improving performance. 2021/12/27 CS of USTC AN Hong 109

On-chip Communication n On-chip networks − Clear. Speed’s Clear. Connect: switching fabric, distributed arbitration

On-chip Communication n On-chip networks − Clear. Speed’s Clear. Connect: switching fabric, distributed arbitration & clocking, 100 Gbps − Cognigine’s RSU Switch Fabric n Buses − Intel’s IX bus interface − Applied Micro’s network-optimized co-processor interface − Brecis’ Multi-Service Bus: dynamic priority switching, queues for different types of traffic 2021/12/27 CS of USTC AN Hong 110

Interconnection n Choices: − Shared bus − Cross-bar − Other: octagon [8] n Issues:

Interconnection n Choices: − Shared bus − Cross-bar − Other: octagon [8] n Issues: − Wiring complexity − Performance − Scalability 2021/12/27 CS of USTC AN Hong 111

VLSI Implementation n CPU pipelining for single-cycle thread switching n Large register sets for

VLSI Implementation n CPU pipelining for single-cycle thread switching n Large register sets for supporting multithreading n SRAM/FCRAM/DRAM memory controller design n High-speed interface (SPI-3/4) n Multiple buses on a system-on-chip 2021/12/27 CS of USTC AN Hong 112

Core Technologies n Router/switch system expertise n Multi-processor in a single chip n Multithreading

Core Technologies n Router/switch system expertise n Multi-processor in a single chip n Multithreading in a processor n Efficient memory hierarchy n Parallel programming environment n VLSI implementation 2021/12/27 CS of USTC AN Hong 113

Looking Forward n Increasing data rate requirements scalable architectures − Parallel processing − On-chip

Looking Forward n Increasing data rate requirements scalable architectures − Parallel processing − On-chip communication architectures n Rise of co-processors − More co-processor companies as NP companies − Impact on programmability/flexibility n Increasing importance of hiding latency − Diverging memory and processor speeds − More PEs larger communication architectures n Focus on mapping applications onto architectures 2021/12/27 CS of USTC AN Hong 114

Some Challenges n Intelligent Design − Given a selection of programs, a target network

Some Challenges n Intelligent Design − Given a selection of programs, a target network link speed, the ‘best’ design for the processor l Least area l Least power l Most performance n Write efficient multithreaded programs − NPs have l l Heterogeneous computer resources Non-uniform memory Multiple interacting threads of execution Real-time constraints − Make use of resources l How to use special instructions and hardware assists • Compilers • Hand-coded − Multithreaded programs l Manage access to shared state l Synchronization between threads 2021/12/27 CS of USTC AN Hong 115

Summary n NP is developing very fast and is a hot research area n

Summary n NP is developing very fast and is a hot research area n Multithreaded NP Architectures provide tremendous packet processing capability n NP can be applied in various network layers and applications − Traditional apps – forwarding, classification − Advanced apps – transcoding, URL-based switching, security etc. − New apps 2021/12/27 CS of USTC AN Hong 116