ONL Stats Block David M Zar Applied Research
ONL Stats Block David M. Zar Applied Research Laboratory Computer Science and Engineering Department
Stats Engine n The Stats Engine is a single ME devoted to accepting messages in a scratch ring and performing increment and add operations to counters. » All MEs that need to update counters will use the Stats Engine » Operations supported will be Atomic increment (+1) n Atomic add (+data) n » Format of the commands will be Opcode(4 b) 2 - David M. Zar - 5/26/2021 Data (12 b) Index (16 b)
ONL NP Router x. Scale Assoc. Data ZBT-SRAM x. Scale NN NN Hdr. Fmt (1 ME) Tx (1 ME) 64 KW Each NN NN SRAM Plugin 4 SRAM Ring Scratch Ring QM (1 ME) Plugin 3 64 KW Parse, Lookup, Copy (3 MEs) Plugin 1 Mux (1 ME) Plugin 0 Rx (2 ME) 64 KW SRAM Plugin 2 TCAM NN Ring QM Copy Plugins 3 - David M. Zar - 5/26/2021 Stats (1 ME) SRAM Tx, QM Parse Plugin XScale Free. List Mgr (1 ME)
MEs -> Stats Block Opcode Data (12 b) (4 b) 4 - David M. Zar - 5/26/2021 Index (16 b) Stats
Opcodes Opcode(4 b) n Opcode Data (12 b) Index (16 b) – » 0011 +1, +data pre-q counter specified in Index » 0111 +1, +data post-q counter specified in Index » 0010 +1 pre-q counter specified in Index » 0110 +1 post-q counter specified in Index » 0001 +data pre-q counter specified in Index » 0101 +data post-q counter specified in Index » 1011 +1, +data global register specified in Index » 1010 +1 global register specified in Index » 1001 +data global register specified in Index (not implemented – 4/23/07) 5 - David M. Zar - 5/26/2021
Stats Counters n n Each Index specifies a group of four counters » Pre-Q packet count » Pre-Q byte count » Post-Q packet count » Post-Q byte count The packet counters get updated when the +1 instructions are specified (opcodes 0 -1 -) The byte counter get updated when the +data instructions are specified (opcodes 0 --1) For plug-ins, the use for each counter can be redefined but the opcodes do not change (i. e. each stats index corresponds to two incrementers and two adders). 6 - David M. Zar - 5/26/2021
Global Registers n For system-wide counters, we define a separate set of global registers to handle them. » RX (packet and byte, 5 ports 10 words) » TX (packet and byte, 5 ports 10 words) » Drop counts (10 words) » Plug-in use (four per plug-in 20 words) » Per ME error counters (8 words) » 10+10+10+20+8 = 58 so reserve 64 words for these n n The register gets incremented when the +1 instructions are specified (opcodes 101 -) The register gets added to updated when the +data instructions are specified (opcodes 10 -1) The RX and TX counters will be assigned on even-word boundaries (lsb = 0) so we associate the packet and byte counters, together, and can do the +1, +data instruction on them in one command (1011 opcode) For plug-ins, the use of each register is under the control of the plug-in » Four independent counters » Two sets of two counters » One set of two and two independent 7 - David M. Zar - 5/26/2021
ONL Router Counter Registers (in dl_system. h) n n n // RX Per Port registers: (Updated by MUX) ONL_ROUTER_RX_PORT 0_PKT_CNTR ONL_ROUTER_RX_PORT 0_BYTE_CNTR ONL_ROUTER_RX_PORT 1_PKT_CNTR ONL_ROUTER_RX_PORT 1_BYTE_CNTR ONL_ROUTER_RX_PORT 2_PKT_CNTR ONL_ROUTER_RX_PORT 2_BYTE_CNTR ONL_ROUTER_RX_PORT 3_PKT_CNTR ONL_ROUTER_RX_PORT 3_BYTE_CNTR ONL_ROUTER_RX_PORT 4_PKT_CNTR ONL_ROUTER_RX_PORT 4_BYTE_CNTR n n n // TX Per Port registers: (Updated by HF) ONL_ROUTER_TX_PORT 0_PKT_CNTR ONL_ROUTER_TX_PORT 0_BYTE_CNTR ONL_ROUTER_TX_PORT 1_PKT_CNTR ONL_ROUTER_TX_PORT 1_BYTE_CNTR ONL_ROUTER_TX_PORT 2_PKT_CNTR ONL_ROUTER_TX_PORT 2_BYTE_CNTR ONL_ROUTER_TX_PORT 3_PKT_CNTR ONL_ROUTER_TX_PORT 3_BYTE_CNTR ONL_ROUTER_TX_PORT 4_PKT_CNTR ONL_ROUTER_TX_PORT 4_BYTE_CNTR n n n // IP Drop registers (Updated by PLC) ONL_ROUTER_IP_HEC_DROP_CNTR ONL_ROUTER_IP_LENGTH_ERR_DROP_CNTR ONL_ROUTER_IP_HDR_LENGTH_ERR_DROP_CNTR ONL_ROUTER_IP_VERSION_ERR_DROP_CNTR 8 - David M. Zar - 5/26/2021
ONL Router Counter Registers (cont. ) n n n // PLC Drop registers (Updated by Parse, Lookup or Copy) ONL_ROUTER_PLC_TO_PLUGIN_DROP_CNTR ONL_ROUTER_PLC_TO_XSCALE_DROP_CNTR n n // QM Drop registers (Updated by QM) ONL_ROUTER_QUEUE_OVERFLOW_DROP_CNTR n n // XScale Drop registers (Updated by XScale) ONL_ROUTER_XSCALE_DROP_CNTR n n // Rx Drop registers (Updated by Rx) ONL_ROUTER_RX__DROP_CNTR n n // Tx Drop registers (Updated by Tx) ONL_ROUTER_TX_DROP_CNTR n n n n n // Per Block Generic Error Counters ONL_ROUTER_RX_GENERIC_ERROR_CNTR ONL_ROUTER_MUX_GENERIC_ERROR_CNTR ONL_ROUTER_PLC_GENERIC_ERROR_CNTR ONL_ROUTER_QM_GENERIC_ERROR_CNTR ONL_ROUTER_HF_GENERIC_ERROR_CNTR ONL_ROUTER_TX_GENERIC_ERROR_CNTR ONL_ROUTER_STATS_GENERIC_ERROR_CNTR ONL_ROUTER_FREELISTMGR_GENERIC_ERROR_CNTR 9 - David M. Zar - 5/26/2021
ONL Router Counter Registers (cont. ) n n n n n n n // Plugin 0 Counters (for use however ONL_ROUTER_PLUGIN_0_CNTR_0 ONL_ROUTER_PLUGIN_0_CNTR_1 ONL_ROUTER_PLUGIN_0_CNTR_2 ONL_ROUTER_PLUGIN_0_CNTR_3 // Plugin 2 Counters (for use however ONL_ROUTER_PLUGIN_1_CNTR_0 ONL_ROUTER_PLUGIN_1_CNTR_1 ONL_ROUTER_PLUGIN_1_CNTR_2 ONL_ROUTER_PLUGIN_1_CNTR_3 // Plugin 2 Counters (for use however ONL_ROUTER_PLUGIN_2_CNTR_0 ONL_ROUTER_PLUGIN_2_CNTR_1 ONL_ROUTER_PLUGIN_2_CNTR_2 ONL_ROUTER_PLUGIN_2_CNTR_3 // Plugin 3 Counters (for use however ONL_ROUTER_PLUGIN_3_CNTR_0 ONL_ROUTER_PLUGIN_3_CNTR_1 ONL_ROUTER_PLUGIN_3_CNTR_2 ONL_ROUTER_PLUGIN_3_CNTR_3 // Plugin 4 Counters (for use however ONL_ROUTER_PLUGIN_4_CNTR_0 ONL_ROUTER_PLUGIN_4_CNTR_1 ONL_ROUTER_PLUGIN_4_CNTR_2 ONL_ROUTER_PLUGIN_4_CNTR_3 10 - David M. Zar - 5/26/2021 Plugin writer wants to use them) Plugin writer wants to use them)
Stats Counter Priority n There are two levels of priority for Stats Counters » High-priority (high-speed) are kept in local memory. There are 64 sets of counters for the router and 64 for the plug-ins » Low-priority (low-speed) are in SRAM. There are 216 -128 = 65408 of these. n n n Stats Counters 0 -127 point to the high-priority counters while 128 -65535 are low-priority counters. Using low-priority Stats Counters to count events that happen at high speed may degrade system performance (being a pre-Q counter on a high -priority queue, for example) Plug-ins need to be aware of the segmentation of priority so they can use the proper priority counters based on needs Global Registers are always high-priority Eight threads used » Seven threads process messages from the input scratch ring » One thread writes 8 W chunks of the local memory counters/registers to SRAM so that each counter/register is updated in SRAM several times a second. 11 - David M. Zar - 5/26/2021
Stats ME Local Memory Map Global Registers Reserved Stats Counters (router) 64*4 W = 256 W 0 63 64 127 128 383 Stats Counters (plug-ins) 64*4 W = 256 W 384 639 12 - David M. Zar - 5/26/2021
Stats Pseudocode While (true and ctx={0: 6}) { dl_source_scr_1 word() decode_opcode() case (opcode) { Global Register: lm_addr = index << 2; do opcode; Stats Index: if (index > 127) { do slow_opcode; } else { lm_addr = (128*4) + (index << 4); do fast_opcode; } } } While (true and ctx=7) { offset = 0; for (l_mem=0; l_mem<(64*4); l_mem=l_mem+8) { sram_write(GLOBAL_REGS_BASE, offset, l_mem, 8); offset = offset + 32; } offset = 0; for (l_mem=(128*4); l_mem<(128*16); l_mem=l_mem+8) { sram_write(ONL_STATS_BASE, offset, l_mem, 8); offset = offset + 32; } } 13 - David M. Zar - 5/26/2021
Stats Function Calls n Defined in counter_util. uc: » _WU_preq_update(reg_num, tx_reg, data, update_sig, error_addr) // +1 & +data » _WU_preq_register_add(reg_num, tx_reg, update_sig, error_addr) // +1 » _WU_preq_register_add(reg_num, tx_reg, data, update_sig, error_addr) // +data » _WU_postq_update(reg_num, tx_reg, data, update_sig, error_addr) // +1 & +data » _WU_postq_register_add(reg_num, tx_reg, update_sig, error_addr) // +1 » _WU_postq_register_add(reg_num, tx_reg, data, update_sig, error_addr) // +data » _WU_global_register_add(reg_num, tx_reg, update_sig, error_addr) // +1 » _WU_global_register_add(reg_num, tx_reg, data, update_sig, error_addr) // +data » _WU_global_register_update(reg_num, tx_reg, data, update_sig, error_addr)// +1 & +data 14 - David M. Zar - 5/26/2021
Performance Targets n How many packets processed per second? » To hit 5 Gb rate: n n n 76 B per min IPv 4 packet (64 min Enet Frame + 12 B IFS) 1. 4 Ghz clock rate 5 Gb/sec * 1 B/8 b * packet/76 B = 8. 23 Mp/sec 1. 4 Gcycle/sec * 1 sec/ 8. 23 Mp = 170 cycles per packet compute budget: 170 cycles latency budget: (threads*170) Ø n How many count requests per packet (typical packet)? » » n 7 threads: 1190 cycles RX per-port count TX per-port count Preq-Q stats index Post-Q stats index Total counts = 8. 23 Mp/sec * 4 counts/sec = 32. 92 Mcounts/sec 15 - David M. Zar - 5/26/2021
Stats Block Diagram Read Scratch Ring LM_ADDR = (index << 2) Y Global Register? (4 CLK) N SCR READ: 60 L + 2 C Index > 127? (3 CLK) Y N Decode Opcode (3 C) +data? (3 C) LM_ADDR = 512 + (index << 4) Y LM_ADDR++ = *LM_ADDR + data N +1? (3 C) N 16 - David M. Zar - 5/26/2021 Y LM_ADDR = *LM_ADDR + 1 Slow Counter Worst case (fast) is for Stats Counters: 20 Clocks + 60 Cycles Latency
Performance Results n Total fast counts: n Slow counts: n SRAM Write-back » Count time is, effectively, 20 cycles (all 60 cycles of latency are hidden) » 1400 Mcycles/sec 20 cycles/count = 70 Mcounts/sec* » Target is 39. 92 Mcounts/sec. » Count time is about 150 – 60 = 90 cycles (the SRAM latency is not completely hidden) » 1400/150 = 15. 6 Mcounts. sec » After each count thread has had the chance to run, the write-back thread writes one 8 -word block of local memory to SRAM. » Measured performance is 20 ms for a full write-back (50 updates per second) » This will slow down the counting, but only by 19 cycles every 7 th count (when the counter is fully-loaded) or less than 3 instructions per count thread. *In simulation, only 17 cycles were measured for >82 Mcounts/sec 17 - David M. Zar - 5/26/2021
Lookup File locations n Code » src/applications/ONL_Router/src/freelist. Mgr. uc » Src/library/dataplane/counter_util. uc n Include Paths » src/applications/ONL_Router/src/dispatch_loop/ONL/ n dl_source. h and dl_source. uc Ø dl_source() and dl_sink() functions » Other, standard, include paths (Intel SDK provided) 18 - David M. Zar - 5/26/2021
- Slides: 18