Network State Awareness and Troubleshooting Faraz Shamim Aamer
Network State Awareness and Troubleshooting Faraz Shamim Aamer Akhter Technical Leader Director Product Management Cisco Systems Inc
Agenda • Troubleshooting Methodology • Packet Forwarding Review • Control Plane Active Monitoring • Logging • Routing Protocol Stability • • Data Plane Active Monitoring • Passive Flow Monitoring • Qo. S • • Getting Started Network State Awareness & troubleshooting
Keeping Focused: What This Session is About • This session is about basic network troubleshooting, focusing on fault detection & isolation • Mostly, vendor neutral • For context, we will cover some basic methodologies and functional elements of network behavior • This session is NOT about Architectures of specific platforms • Data Center technologies • Routing Protocols Troubleshooting • • This is a 90 min tour. ; -) Network State Awareness & troubleshooting 3
The Big Picture It’s not the network Internet’s down. Somebody's downloading something. (? ) Application Operator Can’t ping it. Network Operator Is it Monday? It’s the network Pings fine! Not happy Server network Client Network State Awareness & troubleshooting 4
Some More (network) Detail • A lot of stuff going on • Multiple networks • Multiple applications • Multiple layered services • Mis-information / inconsistency Enterprise DC Server B Enterprise WAN Server A ISP A Not happy DNS LAN Internet DNS DHCP Client Network State Awareness & troubleshooting 802. 1 x 5
… and it keeps on going • Redundant paths / ECMP / LAG • Overlays • Load balancers • Firewalls • NATs Enterprise DC Server B Enterprise WAN ISP B Not happy Server A ISP A DNS LAN Internet DNS DHCP Client Network State Awareness & troubleshooting 802. 1 x 6
Why network state awareness? • What is it: View of network, what it is doing, and why • Monitoring of data network performance, in comparison with previous working states • • Quick detection of hard failures • Early warning for soft failures • performance issues • and tomorrows’ problems • • Faster problem resolution • Greater confidence in network by users and application operators Network State Awareness & troubleshooting 7
Think Like a Network Detective Find the Suspects Question Suspects Improve Be Prepared Network State Awareness & troubleshooting 8
Control Plane & Data Plane • Control Plane Gossip from other routers Processes variety of information sources and policies, creates routing information base (RIB) • Best known intention w/o actual packet in hand Admin Edict • • Routing Protocol(s) APIs Statics Pf. R check L 3 routing Control Plane Check routes Check policy Data Plane The actual forwarding process (might be SW or HW based) • Granted some decision flexibility • • Driven by arriving packet details, traffic conditions etc. packet Int B Int A Data Plane Passive Measurements ifmib Network State Awareness & troubleshooting Cb. Qo. S check forwarding Int C check policy-map int… check interface check flow monitor *Flow 9
Data Plane Decision Flexibility • Control plane: condenses options driven by policies and (relatively) slower moving , aggregated information, eg. prefix reachability, interface state • Data plane responds to packet conditions • • Destination prefix to egress interface matching Multi-path (ECMP / LAG) member selection Interface congestion Qo. S class state Access Lists Packet processing fields (TTL expire, etc) IPv 4 fragmentation, etc Network State Awareness & troubleshooting 10
Network as a System: Independent Decisions • Each network device makes an independent forwarding decision • • • Explicit Local / domain policies Device perspective might not be symmetric Data plane flexibility • Generally happens at WAN-edge and admin boundaries (traffic engineering) • Asymmetric routing Congested link R 5 is doing ECMP hash R 3 R 1 A R 2 your network R 5 R 4 Network State Awareness & troubleshooting You don’t control R 6 B 11
Data Plane and Control Plane Changes • Change is normal, but some changes are more interesting: Single change that causes loss of reachability or suboptimal performance • Instability: high rate of change • • 3 Ws: when, where, and what Network State Awareness & troubleshooting
Control Plane 3 Ws: when, where, and what Network State Awareness & troubleshooting 13
What do I have? • Establish inventory baseline Device names, IPs, configuration • Modular HW configuration • Serial # (for support & replacement) • History (where has it been placed) • • Clearly label devices, ownership and contact info • • Establish standards for location, device/port names Check for changes periodically (tooling) <owner/dept> <device-name> <IP address> <Contact> Example device label <current-location> to <destination-location> <circuit src/dst id> Example cable label Network State Awareness & troubleshooting BRKARC-2025 14
How is it wired together? • Establish network topology baseline • Be prepared to be surprised! • L 2 protocol for discovery (LLDP? ) • • Cisco, Foundry, Nortel Visual inspection R 1 SW 1 R 2 Network State Awareness & troubleshooting 15
Tools for Topology & Inventory Management • Most NMS tools have some element of inventory and topology awareness • Net. Brain • (open source) Net. Disco http: //www. netdisco. org • (open source) Netdot https: //osl. uoregon. edu/redmine/projects/netdot Network State Awareness & troubleshooting 16
Logging • Centrally: for ease of analysis and search Moogsoft - automates early detection of service failures, collaboration & knowledge base • syslog-ng – preprocessing, relay and store(file/db) • Logstash(ELK), fluentd – multisource collection, storage and analysis • • Locally: in case logs can’t get home Network State Awareness & troubleshooting 17
State of the Routing Table • Be familiar with normal behavior of important service prefixes Establish quickly if problem is control plane or data plane • Check routing table/ ip. Route. Table MIB / check ip traffic (Drop stats) • Track objects • #show ip route 192. 168. 2. 2 Routing entry for 192. 168. 2. 2/32 Known via "ospf 1", distance 110, metric 11, type intra area Last update from 10. 0. 0. 2 on Fast. Ethernet 0/0, 00: 13 ago Routing Descriptor Blocks: * 10. 0. 0. 2, from 2. 2, 00: 13 ago, via Fast. Ethernet 0/0 Route metric is 11, traffic share count is 1 Network State Awareness & troubleshooting 18
OSPF Area / AS-Wide • Remember that OSPF data in area should be consistent • Understand ‘normal’ rate of changes • • • LSA refresh /30 -min unless a change Track SPF runs over time number of LSAs expected OSPF-MIB: Ospf. Spf. Runs, ospf. Area. LSACount Route missing? Where is the network supposed to be attached? Is it still? • check interface (on advertising router) • Check ospf database … • # show ip ospf Routing Process "ospf 1" with ID 192. 168. 0. 1 Start time: 00: 01: 46. 195, Time elapsed: 00: 48: 27. 308 Supports only single TOS(TOS 0) routes Supports opaque LSA Supports Link-local Signaling (LLS) Supports area transit capability Supports NSSA (compatible with RFC 3101) Supports Database Exchange Summary List Optimization (RFC 5243) Event-log enabled, Maximum number of events: 1000, Mode: cyclic Router is not originating router-LSAs with maximum metric Initial SPF schedule delay 5000 msecs Minimum hold time between two consecutive SPFs 10000 msecs Maximum wait time between two consecutive SPFs 10000 msecs Incremental-SPF disabled Minimum LSA interval 5 secs Minimum LSA arrival 1000 msecs LSA group pacing timer 240 secs Interface flood pacing timer 33 msecs Retransmission pacing timer 66 msecs Number of external LSA 0. Checksum Sum 0 x 000000 Number of opaque AS LSA 0. Checksum Sum 0 x 000000 Number of DCbitless external and opaque AS LSA 0 Number of Do. Not. Age external and opaque AS LSA 0 Number of areas in this router is 1. 1 normal 0 stub 0 nssa Number of areas transit capable is 0 External flood list length 0 IETF NSF helper support enabled Cisco NSF helper support enabled Reference bandwidth unit is 100 mbps Area BACKBONE(0) Number of interfaces in this area is 4 (1 loopback) Area has no authentication SPF algorithm last executed 00: 47: 05. 379 ago SPF algorithm executed 4 times Area ranges are Number of LSA 16. Checksum Sum 0 x 078460 Number of opaque link LSA 0. Checksum Sum 0 x 000000 Number of DCbitless LSA 0 Number of indication LSA 0 Number of Do. Not. Age LSA 0 Network State Awareness & troubleshooting Flood list length 0
OSPF Neighborships • neighbor adjacencies • • • Check ospf neighbor detail (OSPF-MIB: ospf. Nbr. State, ospf. Nbr. Events, ospf. Nbr. LSRetrans. QLen) How many state changes occur? What is the current state? Any retransmission happening? Check the interface queue # show ip ospf neighbor detail Neighbor 192. 168. 0. 7, interface address 10. 0. 0. 3 In the area 0 via interface Gigabit. Ethernet 0/1 Neighbor priority is 1, State is FULL, 6 state changes DR is 10. 0. 0. 3 BDR is 10. 0. 0. 4 Options is 0 x 12 in Hello (E-bit, L-bit) Options is 0 x 52 in DBD (E-bit, L-bit, O-bit) LLS Options is 0 x 1 (LR) Dead timer due in 00: 39 Neighbor is up for 00: 33: 10 Index 2/2/2, retransmission queue length 0, number of retransmission 0 First 0 x 0(0)/0 x 0(0) Next 0 x 0(0)/0 x 0(0) Last retransmission scan length is 0, maximum is 0 Last retransmission scan time is 0 msec, maximum is 0 msec Network State Awareness & troubleshooting 20
BGP Monitoring Protocol (BMP) Overview Collecting Pre-Policy BGP Messages BMP collector BMP message Adj-RIB-in (pre-inbound-filter) BGP Monitor Protocol update Loc-RIB (post-inbound-filter) i. BGP update BGP peer (internal) Adj-RIB-in (pre-inbound-filter) e. BGP update Inbound filtering policing BMP client Network State Awareness & troubleshooting BGP peer’s (external) 21
BGP Monitoring Protocol • IETF RFC 7854 • BMP client (router) provides pre-policy view of the ADJ-RIB-IN of a peer • Update messages from peer sent to BMP receiver • Example uses: Realtime visualizer of BGP state • Traffic engineering analytics • BGP policy exploration • Network State Awareness & troubleshooting 22
Open. BMP http: //www. openbmp. org Historical record of prefix withdraws Current route views and peer status Network State Awareness & troubleshooting 23
Data Plane 3 Ws: when, where, and what Network State Awareness & troubleshooting 24
User / Agent Checks • Treat network as a black box: are your beacon services working? • Synthetic service check (HTTP, DNS, etc. ) • Ping (not all remotes will respond) • Data plane is exercised and tested • Variety = better coverage (multiple IP addresses / L 4 ports per location) • Validate similar treatment (Qo. S) as real user traffic • Uptime and performance (loss, latency) metrics • Look for patterns, changes from normal. All down vs some down. • Capture and validate real user (human) incidents. What got missed? • Use wisely: network and server resources consumed R 3 R 1 A R 2 Network State Awareness & troubleshooting R 5 R 6 B 25
IP SLA*(RFC 6812): Synthetic Traffic Measurements Uses Network Performance Monitoring Availability Vo. IP Monitoring Service Level Agreement (SLA) Monitoring Multiprotocol Label Switching (MPLS) Monitoring Network Assessment Trouble Shooting Measurement Metrics Packet Loss Latency Network Jitter Dist. of Stats Connectivity Operations Jitter FTP DNS DHCP DLSW ICMP UDP TCP HTTP LDP H. 323 IP SLA Source MIB Data IP SLA Active Generated Traffic to Measure the Network Destination IP SLA Responder Network State Awareness & troubleshooting *IP SLA can be replaced with other monitoring tools used by other vendors such as RPM of Juniper etc SIP RTP • IPSLA on router/switch – Shadow Router? • User end-system based agent software • Dedicated Agent 26
traceroute Internet: aka the TCP/80 network • Understand the limitations • Sends 3 packets (default) at each TTL • Implementations • Widest dispersion against possibilities. Difficult to understand though. Linux/Cisco: UDP (ICMP and TCP-SYN are Linux optional) • UDP DST port # used to keep track of packets, increments per packet. Initial= 33434 (default) • SRC port #: randomized (linux), incrementing per packet (Cisco IOS) • Linux (GNU inetutils-traceroute) • UDP DST port# increments per TTL (not per packet) • SRC port is random but fixed per entire run • Windows: ICMP Echo request Narrower dispersion. Story might be misleading. ICMP blocked frequently Network State Awareness & troubleshooting 27
Unix traceroute • Multiple path options • Topology ‘shortcuts’ (same router seen at diff hop) • Ultimately all paths result in similar e 2 e delay Reference 1 AAA 2 BBB 3 CCC 4 DDD 5 EEE 6 FGF 7 HII 8 JKK +10 ms (unsustained) 9 JLJ 10 LLM +120 ms (sustained) 11 NNM 12 NNO 13 PPP 14 QQQ 15 *** 16 RRR ~268 ms (all three) $ traceroute 62. 2. 88. 172 traceroute to 62. 2. 88. 172 (62. 2. 88. 172), 30 hops max, 60 byte packets 1 152. 242. 65 (152. 242. 65) 1. 044 ms 1. 371 ms 1. 585 ms 2 152. 240. 8 (152. 240. 8) 0. 219 ms 0. 328 ms 0. 327 ms 3 128. 109. 70. 9 (128. 109. 70. 9) 1. 066 ms 1. 059 ms 1. 168 ms 4 rtp 7600 -gw-to-dep 7600 -gw 2. ncren. net (128. 109. 70. 137) 1. 634 ms 1. 628 ms 1. 736 ms 5 rlasr-gw-link 1 -to-rtp 7600 -gw. ncren. net (128. 109. 9. 17) 5. 354 ms 5. 446 ms 5. 557 ms 6 128. 109. 9. 117 (128. 109. 9. 117) 5. 671 ms 128. 109. 9. 170 (128. 109. 9. 170) 7. 141 ms 128. 109. 9. 117 (128. 109. 9. 117) 5. 433 ms +120 ms 7 wscrs-gw-to-ws-a 1 a-ip-asr-gw-sec. ncren. net (128. 109. 1. 105) 9. 174 ms 128. 109. 1. 209 (128. 109. 1. 209) 8. 256 ms 6. 397 ms 8 dcp-brdr-03. inet. qwest. net (205. 171. 251. 110) 18. 414 ms chr-edge-03. inet. qwest. net (65. 114. 0. 205) 27. 353 ms 27. 438 ms Atlantic 9 dcp-brdr-03. inet. qwest. net (205. 171. 251. 110) 21. 739 ms 63 -235 -40 -106. dia. static. qwest. net (63. 235. 40. 106) 17. 750 ms dcp-brdr-03. inet. qwest. net (205. 171. 251. 110) 22. 450 ms crossing 10 63 -235 -40 -106. dia. static. qwest. net (63. 235. 40. 106) 22. 531 ms 22. 516 ms 84 -116 -130 -173. aorta. net (84. 116. 130. 173) 140. 738 ms 11 nl-ams 02 a-rd 1 -te 0 -2 -0 -2. aorta. net (84. 116. 130. 65) 140. 831 ms 140. 816 ms 84 -116 -130 -173. aorta. net (84. 116. 130. 173) 144. 819 ms 12 nl-ams 02 a-rd 1 -te 0 -2 -0 -2. aorta. net (84. 116. 130. 65) 144. 074 ms 144. 761 ms 84 -116 -130 -58. aorta. net (84. 116. 130. 58) 138. 455 ms filter 13 84 -116 -130 -58. aorta. net (84. 116. 130. 58) 141. 844 ms 141. 924 ms 142. 459 ms + > 100 ms 14 84. 116. 204. 234 (84. 116. 204. 234) 145. 603 ms 145. 891 ms 145. 987 ms delay 15 * * * 28 Network State Awareness & troubleshooting 16 62 -2 -88 -172. static. cablecom. ch (62. 2. 88. 172) 268. 281 ms 268. 245 ms 268. 176 ms
Reference Unix inetutils traceroute • Narrower view (no alternate paths directly seen) • Repeating nodes suggests multipath, or (unlikely) routing issue $ inetutils-traceroute --resolve-hostname 62. 2. 88. 172 traceroute to 62. 2. 88. 172 (62. 2. 88. 172), 64 hops max 1 152. 242. 65 (152. 242. 65) 0. 783 ms 0. 727 ms 0. 798 ms 2 152. 240. 8 (152. 240. 8) 0. 226 ms 0. 228 ms 0. 221 ms 3 128. 109. 70. 9 (128. 109. 70. 9) 0. 967 ms 0. 980 ms 0. 962 ms 4 128. 109. 70. 137 (rtp 7600 -gw-to-dep 7600 -gw 2. ncren. net) 1. 576 ms 1. 598 ms 1. 567 ms 5 128. 109. 9. 17 (rlasr-gw-link 1 -to-rtp 7600 -gw. ncren. net) 5. 149 ms 5. 140 ms 5. 126 ms 6 128. 109. 9. 166 (128. 109. 9. 166) 7. 113 ms 7. 098 ms 7. 306 ms 7 128. 109. 1. 209 (128. 109. 1. 209) 7. 835 ms 8. 326 ms 7. 958 ms 8 65. 114. 0. 205 (chr-edge-03. inet. qwest. net) 19. 944 ms 9. 299 ms 40. 372 ms 9 63. 235. 40. 106 (63 -235 -40 -106. dia. static. qwest. net) 18. 442 ms 18. 412 ms 18. 432 ms 10 63. 235. 40. 106 (63 -235 -40 -106. dia. static. qwest. net) 22. 424 ms 22. 391 ms 75. 960 ms 11 84. 116. 130. 173 (84 -116 -130 -173. aorta. net) 145. 434 ms 146. 301 ms 145. 445 ms 12 84. 116. 130. 58 (84 -116 -130 -58. aorta. net) 137. 583 ms 137. 556 ms 137. 661 ms 13 84. 116. 130. 58 (84 -116 -130 -58. aorta. net) 142. 476 ms 141. 886 ms 141. 819 ms 14 84. 116. 204. 234 (84. 116. 204. 234) 144. 841 ms 145. 034 ms 144. 964 ms 15 * * * 16 62. 2. 88. 172 (62 -2 -88 -172. static. cablecom. ch) 287. 318 ms 176. 670 ms 254. 237 ms Network State Awareness & troubleshooting Packets for hop 9, 12 took a ‘shortcut’ and packets for hop 10, 13 went long way 29
Reference LFT • lft ‘layer 4 traceroute’ dynamically adjusts to responses • Firewall detection, whois and AS lookup integrated • Narrower packet changes, so narrower multi-path $ sudo lft -ENA 62. 2. 88. 172 Tracing ________________________________. Used tcp/80 SYN TTL LFT trace to 62 -2 -88 -172. static. cablecom. ch (62. 2. 88. 172): 80/tcp 1 [AS 81] [NCREN-B 22] 152. 242. 65 20. 1/17. 2 ms 2 [AS 81] [NCREN-B 22] 152. 240. 8 20. 1/20. 1 ms 3 [AS 81] [CONCERT] 128. 109. 70. 9 20. 1/20. 1 ms 4 [AS 81] [CONCERT] rtp 7600 -gw-to-dep 7600 -gw 2. ncren. net (128. 109. 70. 137) 20. 1/20. 1 ms 5 [AS 81] [CONCERT] rlasr-gw-link 1 -to-rtp 7600 -gw. ncren. net (128. 109. 9. 17) 20. 1/20. 1 ms 6 [AS 81] [CONCERT] 128. 109. 9. 117 20. 1/20. 1 ms 7 [AS 209] [unknown] chr-edge-03. inet. qwest. net (65. 121. 156. 209) 20. 1/19. 5 ms 8 [AS 209] [QWEST-INET-35] dcp-brdr-03. inet. qwest. net (205. 171. 251. 110) 20. 1/18. 4 ms 9 [AS 209] [QWEST-INET-17] 63 -235 -40 -106. dia. static. qwest. net (63. 235. 40. 106) 20. 1/60. 3 ms 10 [AS 6830] [84 -RIPE/LGI-Infrastructure] 84 -116 -130 -173. aorta. net (84. 116. 130. 173) 160. 7/160. 7 ms 11 [AS 6830] [84 -RIPE/LGI-Infrastructure] nl-ams 02 a-rd 1 -te 0 -2 -0 -2. aorta. net (84. 116. 130. 65) 160. 7/160. 7 ms 12 [AS 6830] [84 -RIPE/LGI-Infrastructure] 84 -116 -130 -58. aorta. net (84. 116. 130. 58) 140. 6/140. 6 ms ** [firewall] the next gateway may statefully inspect packets 13 [AS 6830] [84 -RIPE/LGI-Infrastructure] 84. 116. 204. 234 160. 7/160. 6 ms ** [neglected] no reply packets received from TTL 14 15 * [AS 6830] [RIPE-C 3/CC-HO 841 -NET] [target] 62 -2 -88 -172. static. cablecom. ch (62. 2. 88. 172): 80 160. 7 ms Network State Awareness & troubleshooting 30
Reference MTR • Interactive combined traceroute and ping • Gives a sense of health of path (loss, delay Standard Deviation) • Narrow path view Just local noise, no carry over to later hops $ mtr 62. 2. 88. 172 aakhter-nlr-ubuntu-01 (0. 0) Sat May 30 18: 57: 09 2015 Keys: Help Display mode Restart statistics Order of fields quit Packets Pings Host Loss% Snt Last Avg Best Wrst St. Dev 1. 152. 242. 65 0. 0% 145 0. 8 0. 9 0. 7 10. 0 0. 8 2. 152. 240. 8 0. 0% 145 0. 3 0. 2 0. 3 0. 0 3. 128. 109. 70. 9 0. 0% 145 1. 0 3. 3 1. 0 182. 3 17. 2 4. rtp 7600 -gw-to-dep 7600 -gw 2. ncren. net 1. 0% 145 9. 2 4. 1 1. 6 203. 4 18. 6 5. rlasr-gw-link 1 -to-rtp 7600 -gw. ncren. net 0. 0% 145 5. 3 5. 1 6. 8 0. 2 6. 128. 109. 9. 166 0. 0% 145 7. 1 7. 3 7. 1 16. 1 0. 8 7. wscrs-gw-to-ws-a 1 a-ip-asr-gw-sec. ncren. net 0. 0% 145 6. 8 8. 3 6. 2 10. 6 1. 0 8. chr-edge-03. inet. qwest. net 0. 0% 145 9. 4 12. 3 9. 3 62. 1 9. 5 9. dcp-brdr-03. inet. qwest. net 0. 0% 145 21. 8 22. 8 21. 7 70. 7 5. 5 10. 63 -235 -40 -106. dia. static. qwest. net 0. 0% 145 21. 8 24. 5 21. 7 86. 1 10. 6 11. 84 -116 -130 -173. aorta. net 0. 0% 145 144. 8 145. 0 144. 7 152. 9 1. 0 12. nl-ams 02 a-rd 1 -te 0 -2 -0 -2. aorta. net 0. 0% 145 144. 1 145. 5 144. 0 165. 4 3. 7 13. 84 -116 -130 -58. aorta. net 5. 0% 144 142. 9 142. 3 142. 0 145. 6 0. 4 14. 84. 116. 204. 234 5. 0% 144 145. 1 144. 9 145. 3 0. 0 15. 217 -168 -62 -150. static. cablecom. ch 5. 0% 144 145. 9 146. 1 145. 2 164. 3 1. 9 Network State Awareness & troubleshooting 16. 62 -2 -88 -172. static. cablecom. ch 5. 0% 144 313. 0 260. 3 152. 6 508. 0 80. 0 Sustained loss. Likely something wrong 12 ->13, or way back Note variability, probably just the end system 31
Check interface • Classic command • Check interface ‘up’ status • Stability: check log event or check routing table stability • Monitor in/out bit/packet changes # show interface Gigabit. Ethernet 1 is up, line protocol is up Hardware is CSR v. NIC, address is 000 c. 291 a. 7 f 97 (bia 000 c. 291 a. 7 f 97) Internet address is 192. 168. 225. 130/24 MTU 1500 bytes, BW 1000000 Kbit/sec, DLY 10 usec, reliability 255/255, txload 1/255, rxload 1/255 Encapsulation ARPA, loopback not set Keepalive set (10 sec) Full Duplex, 1000 Mbps, link type is auto, media type is RJ 45 output flow-control is unsupported, input flow-control is unsupported ARP type: ARPA, ARP Timeout 04: 00 Last input 00: 05: 35, output 00: 09: 58, output hang never Last clearing of "show interface" counters never Input queue: 0/375/0/0 (size/max/drops/flushes); Total output drops: 0 Queueing strategy: fifo Output queue: 0/40 (size/max) 5 minute input rate 0 bits/sec, 0 packets/sec 5 minute output rate 0 bits/sec, 0 packets/sec 25349 packets input, 2381158 bytes, 0 no buffer Received 0 broadcasts (0 IP multicasts) 0 runts, 0 giants, 0 throttles 0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored 0 watchdog, 0 multicast, 0 pause input 3958 packets output, 312408 bytes, 0 underruns 0 output errors, 0 collisions, 0 interface resets 56 unknown protocol drops 0 babbles, 0 late collision, 0 deferred 0 lost carrier, 0 no carrier, 0 pause output 0 output buffer failures, 0 output buffers swapped out Network State Awareness & troubleshooting
Follow the Flow with Net. Flow(RFC 3954) • Per-Node: Data plane observations and decisions captured • Src/dst mac/IP/port#s, DSCP values, in/out interfaces, etc. • Network view: flows centrally analyzed- Net. Flow collector/analyzer • Biggest value: strategically placed partial views (eg WAN edge) Net. Flow Collector Live. Action R 3 R 1 A R 2 R 5 R 4 Network State Awareness & troubleshooting R 6 B 33
Net. Flow(RFC 3954)—What Is It? • Developed and patented at Cisco Systems in 1996 • Net. Flow is the de facto standard for acquiring IP operational data • Standardized in IETF via IPFIX (RFC 7011) • Provides network and security monitoring, network planning, traffic analysis, and IP accounting • Packet capture is like a wire tap • Net. Flow is like a phone bill Network World Article—Net. Flow Adoption on the Rise Network State Awareness & troubleshooting http: //www. networkworld. com/newsletters/nsm/2005/0314 nsm 1. html 34
Flexible Net. Flow Multiple Monitors with Unique Key Fields Traffic Flow Monitor 1 Flow Monitor 2 Non-Key Fields Packet 1 Non-Key Fields Packets Source IP 3. 3 Packets Bytes Dest IP 2. 2 Timestamps 23 Timestamps Input Interface Ethernet 0 Destination Port 22078 Next Hop Address SYN Flag 0 Layer 3 Protocol TCP - 6 TOS Byte 0 Input Interface Ethernet 0 Key Fields Packet 1 Source IP 3. 3 Destination IP 2. 2 Source Port Security Analysis Cache Source IP Dest. IP Traffic Analysis Cache 3. 3 Src. IP Dest. IP Source Port Dest. Port Protocol 3. 3 2. 2 23 22078 6 TOS Input I/F … Pkts 0 E 0 … 1100 Network State Awareness & troubleshooting 2. 2 Input I/F Flag … Pkts E 0 0 … 11000 35
Net. Flow Forwarding Status & Drop Count Fields RFC 7270 • Flexible Net. Flow Forwarding Status field captures forwarding (and drop reason) for flow. • Drop Count increments on any explicit drop by router Network State Awareness & troubleshooting 36
Network Performance Monitor Network nodes are able to discover & validate RTP, TCP and IP-CBR traffic on hop by hop basis À la carte metric (loss, latency, jitter etc. ) selections, applied on operator selected sets of traffic Allows for fault isolation and network span validation Per-application threshold and altering. Network State Awareness & troubleshooting 37
Performance Monitor Information Elements Media Monitoring • • RTP SSRC RTP Jitter (min/max/mean) Transport Counter (expected/loss) Media Counter (bytes/packets/rate) Media Event Collection interval TCP MSS TCP round-trip time Other Metrics Application Response Time • • • • CND - Client Network Delay (min/max/sum) SND – Server Network Delay (min/max/sum) ND – Network Delay (min/max/sum) AD – Application Delay (min/max/sum) Total Response Time (min/max/sum) Total Transaction Time (min/max/sum) Number of New Connections Number of Late Responses Number of Responses by Response Time (7 bucket histogram) Number of Retransmissions Number of Transactions Client/Server Bytes Client/Server Packets Network State Awareness & troubleshooting • • • L 3 counter (bytes/packets) Flow event Flow direction Client and server address Source and destination address Transport information Input and output interfaces L 3 information (TTL, DSCP, TOS, etc. ) Application information (from deep packet inspection tool) Monitoring class hierarchy 38
Net. Flow Qo. S Analysis How is my flow being classified? Did this Qo. S class drop traffic? Cisco Prime Infra Live. Action flow 5 -tuple DPI/NBAR Network State Awareness & troubleshooting Qo. S processing DSCP 39
Dedicated Protocol Analyzers • Wireshark and other protocol analyzers are great • Detailed analysis for variety of protocols at deep level • Dedicated probes are expensive to deploy pervasively • Operator has to make difficult judgment calls on where the problem is going to be– before it happens • Can be challenging after the fact- need on-site trained personnel. Network State Awareness & troubleshooting 40
Embedded Packet Capture & Analyze • Capture packets locally to buffer on router • Store to flash, USB, FTP, TFTP for analysis in protocol analyzer • Capture does not add traffic to network LY-2851 -8#monitor LY-2851 -8#monitor capture capture buffer pcap-buffer 1 size 10000 max-size 1550 point ip cef pcap-point 1 g 0/0 both point associate pcap-point 1 pcap-buffer 1 point start pcap-point 1 point stop pcap-point 1 buffer pcap-buffer 1 export ftp: //10. 17. 0. 252/images/test. cap Gig 0/0 Network State Awareness & troubleshooting
in-band OAM for IPv 6 (i. OAM 6) • New IPv 6 extension header defined on user packets vs. IPv 4 record-route option header • RFC 2460 does not define an option to record the route • Minimal performance hit (handled in data plane) • Packets continue on regular path • • Apps/Controller Instrumentation Packet sequence numbers => detect packet loss • Time stamps => one way delay • Node and ingress/egress interface names => path recording • draft-brockners-inband-oam-requirements-03 • v 6 traffic matrix Live flow tracing Delay distributi on Bicastíng control Loss matrix/ monitor App data monitori ng Enhanced Telemetry Per hop and end-to-end data added to (selected) data traffic into the packet Ingress Node-ID egress i/f Sequenc Timesta Appe# mp Data Network Element Network State Awareness & troubleshooting 42
i. OAM 6 Path Trace V 6 extension header applied/decapped R 4 H 1 • R 1 R 3 H 3 : : A: 1: 1: 0: 1 D R 2 Extended Ping H 1#ping Protocol [ip]: ipv 6 Target IPv 6 address: : : A: 1: 1: 0: 1 D Repeat count [5]: 1 Datagram size [100]: 300 Timeout in seconds [2]: Extended commands? [no]: yes End system ICMP Source address or interface: gig 0/1 stack i. OAM 6 enabled UDP protocol? [no]: Verbose? [no]: yes Precedence [0]: DSCP [0]: Include hop by hop Path Record option? [no]: yes Sweep range of sizes? [no]: Type escape sequence to abort. Sending 1, 300 -byte ICMP Echos to : : A: 1: 1: 0: 1 D, timeout is 2 seconds: (Gi 0/1)R 1(Gi 0/2)----(Gi 0/1)R 4(Gi 0/2)----(Gi 0/2)R 3(Gi 0/3)----H 3 ----(Gi 0/3)R 3(Gi 0/2)----(Gi 0/2)R 4(Gi 0/1)----(Gi 0/2)R 1(Gi 0/1) Reply to request 0 (35 ms) Success rate is 100 percent (1/1), round-trip min/avg/max = 35/35/35 ms Network State Awareness & troubleshooting 43
Getting Started Network State Awareness & troubleshooting 44
Be Prepared! • Be prepared and have data collection systems enabled Enable passive monitoring on endpoints and network • Enable active tests • • Helpdesk Interview Script => establish & maintain checklists • Multi-group access to tools, logs, etc. • • Firefighters run drills, so should your teams! Be familiar with the tools and how they respond on your network • Red phone: Cross-domain teams (applications, UC, security, servers) • Network State Awareness & troubleshooting 45
Expanding your Toolbox and Knowledge • Commercial and open source tools to look at • • • Network topology & IP address management: netdot, GestióIP Performance tests: iperf 3, netperf Service checks: Nagios Core, Zenoss Core Net. Flow / Log analysis/moniroting: logstash, fluentd, splunk Template driven config generation: ansible Control Plane Troubleshooting (Troubleshooting IP Routing Protocols Network State Awareness & troubleshooting 46
Network Documentation Tool (netdot) • Open source • Started in 2002 • Network interfaces discovery via SNMP • Discovery of L 2 & L 3 devices • Dynamically draws a diagram/topology of your network • Management of IPv 4 and IPv 6 address via IPAM Network State Awareness & troubleshooting 47
Network Documentation Tool (netdot) Network State Awareness & troubleshooting 48
GestióIP • Web based IP Management software • Concurrent users support • Better search and filter capabilities than traditional spreadsheet • Better statistical data • Less chance of human error compare to spreadsheet • Migration assistance from managed IPv 4 to IPv 6 via tool Network State Awareness & troubleshooting 49
GestióIP Network State Awareness & troubleshooting 50
iperf 3 • Active measurement tool to discover available path capacity • worst link and worst host configurations • Test can be in either direction (only static NAT works) • TCP (retransmissions, rate, cwd), SCTP and UDP (loss, jitter, out of order) tests TCP/5201 sender receiver Test traffic: TCP, SCTP, UDP Network State Awareness & troubleshooting 51
$ bwctl -T iperf 3 -t 30 -O 4 -s "56 m-ps-4 x 10. sox. net: 4823" bwctl: Using tool: iperf 3 bwctl: 40 seconds until test results available Iperf 3 examples. Client to server (local to remote) Throw away stats from first 4 sec Run for 30 sec SENDER START Connecting to host 152. 242. 103, port 5160 [ 15] local 143. 215. 194. 123 port 45609 connected to 152. 242. 103 port 5160 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 15] 0. 00 -1. 00 sec 107 MBytes 898 Mbits/sec 0 3. 06 MBytes (omitted) [ 15] 1. 00 -2. 00 sec 112 MBytes 944 Mbits/sec 0 3. 06 MBytes (omitted) … [ 15] 29. 00 -30. 00 sec 112 MBytes 944 Mbits/sec 0 3. 06 MBytes - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 15] 0. 00 -30. 00 sec 3. 29 GBytes 942 Mbits/sec 0 sender [ 15] 0. 00 -30. 00 sec 3. 29 GBytes 943 Mbits/sec receiver iperf Done. SENDER END Use –P for parallel streams $ $ bwctl -T iperf 3 -t 30 -O 4 -c "56 m-ps-4 x 10. sox. net: 4823" bwctl: Using tool: iperf 3 bwctl: 39 seconds until test results available SENDER START Connecting to host 143. 215. 194. 123, port 5327 [ 15] local 152. 242. 103 port 44855 connected to 143. 215. 194. 123 port 5327 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 15] 0. 00 -1. 00 sec 5. 14 MBytes 43. 1 Mbits/sec 411 25. 5 KBytes (omitted) [ 15] 1. 00 -2. 00 sec 2. 26 MBytes 19. 0 Mbits/sec 15 19. 8 KBytes (omitted) … [ 15] 28. 00 -29. 00 sec 2. 26 MBytes 18. 9 Mbits/sec 16 25. 5 KBytes - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 15] 0. 00 -30. 00 sec 59. 8 MBytes 16. 7 Mbits/sec 539 sender [ 15] 0. 00 -30. 00 sec 60. 7 MBytes 17. 0 Mbits/sec receiver iperf Done. SENDER END ~940 mbps (remote to local) retransmissions ~19 mbps (local to remote) Network State Awareness & troubleshooting ∫∫∫∫∫∫∫
netperf • Similar to iperf 3 but: Works bidirectionally in a NAT environment • additional connection/per second and transaction/per second tests • statistical confidence intervals (-I) • > netperf -t TCP_STREAM -H 162. 209. 79. 211 -i 30, 10 -I 95, 5 -j -l 60 MIGRATED TCP STREAM TEST from 0. 0 (0. 0) port 0 AF_INET to 162. 209. 79. 211 () port 0 AF_INET : +/-2. 500% @ 95% conf. : demo !!! WARNING !!! Desired confidence was not achieved within the specified iterations. !!! This implies that there was variability in the test environment that !!! must be investigated before going further. !!! Confidence intervals: Throughput : 8. 965% !!! Local CPU util : 0. 000% !!! Remote CPU util : 0. 000% Recv Send Socket Message Elapsed Size Time Throughput bytes secs. 10^6 bits/sec 87380 16384 60. 52 13. 91 download Network State Awareness & troubleshooting ∫∫∫∫∫∫∫
Nagios Core • Monitoring and alerting engine • Open source written in C • Flexible and scalable architecture • Event scheduler, event processor and alert manager • APIs can be used to extend the capabilities to perform additional tasks • Designed for Unix & Linux systems Network State Awareness & troubleshooting 54
Nagios Core Network State Awareness & troubleshooting 55
Zenoss Core • Network Management System • Open source • Written in 90% Python, 10% Java • GNU based license • Web based interface for monitoring • Used by Govt sector, Retails, financial institutions, SP and tech industry Network State Awareness & troubleshooting 56
Zenoss Core Network State Awareness & troubleshooting 57
logstash • Open source tool for managing events and logs • Data/logs Collection(many sources), filter and display logs • Scalable data processing • Analysis, Archiving, Monitoring & Alerting • Elasticsearch API is used for storage • Kibana (a broswer based analytics) is developed to view Logstash data Network State Awareness & troubleshooting 58
fluentd • Open source data collector • Written in C and Ruby • Requires very little system resource • Simple and flexible/extensible • 5000+ companies are using fluentd for data collection • Provides unified logging layer • Decouples data sources from backend systems Network State Awareness & troubleshooting 59
ansible • Automates software provisioning and configuration management • Use for application deployment/migration • Reduces complexity and repetition • Comes with Fedora distribution of Linux • Used for Linux, Unix and Windows • Written in python and powershell • Can be used in the cloud environment also such as AWS, Azure etc. Network State Awareness & troubleshooting 60
Splunk • Accessible via standard browser or via mobile app • Collects and index data • Powerful statistical search of the data • Correlate and investigate between events and activities • Display reports in a customized dashboard • Turn searches into real time alert • Notifies via email or RSS Network State Awareness & troubleshooting 61
Splunk Network State Awareness & troubleshooting 62
Alerting & Collaboration • Routing of alerts / interesting events • • • Is this noise or signal? Which team(s) to alert? Who is on duty? How to contact: SMS, IM, phone call… Pagerduty, Openduty Pager. Duty • Coordinating response IM tools (Spark, hipchat etc. ) • Email • Ticketing tools (OTRS, Jira, Service. Now, Moogsoft…) • Network State Awareness & troubleshooting 63
Thank You Presentation ID 64
- Slides: 64