Lessons from a SIP Wireless Deployment Jonathan Rosenberg

Lessons from a SIP Wireless Deployment Jonathan Rosenberg Chief Scientist

Background l Market: 2. 5 g Wireless Network l Initial Applications: Instant Messaging (IM) and Presence l Subscriber Sizing: 500 k Initially, Scaling up to Several Million l Po. Ps: 12 Regional Po. Ps, 2 Centralized Data Centers l Servers: Between 100 and 200 Separate Server Processes SIP 2003 Lessons 2

Lessons Summary l Data Distribution and Management Is Hard l Network-wide Diagnostics Are Essential l UDP Non-invites and Failover Interactions SIP 2003 Lessons 3

The Data Distribution Problem l SIP Applications Depend on Many Pieces of Data n Provisioned data l Buddy lists l White/black lists l Call forwarding numbers n Soft-state data l Presence l Registrations l Many Parties Interested in Writing the Data n Wireless handset updates its buddy list n Web application updates a buddy list n Customer care updates a buddy list l Many Parties Interested in Reading the Data n Wireless handset, to get their current buddy list n Web application, to display the current buddy list n Customer care, to tell a customer who is on their buddy list n Presence server, to support subscriptions to the buddy list l Many Parties Interested in Finding Out Changes to the Data n Handset – for buddy list synchronization n Presence server: to send a SUBSCRIBE request to a new participant n Other applications SIP 2003 Lessons 4

Requirements for Data Distribution l Network Element Requirements for Data n “Close” to the element, for performance reasons n Replicated and consistent across all elements in a cluster within a pop n Replicated to other pops to provide pop failover n Soft-state data replicated to backup servers for failover support l Operator Requirements for Data n Data survives crashes of any or all network elements n Data can be read/written by provisioning and customer support systems n Data can be accessed by provisioning and customer support from a single access point, independent of network scale and size n Data writes are validated before being propagated n Data propagation to elements survives network faults (IP router goes down), element failures, etc. n Distribution of provisioned data has minimal to no impact on element performance (i. e. , A bulk-load cannot take down a running system) n Recovery from data distribution failures needs to be possible SIP 2003 Lessons 5

Key Lessons l The Requirements for “Closeness” and “Performance” Conflict with Consistency Requirements n Ultimately, the data gets replicated across a potentially large number of elements n Large scale replication with transactional integrity is very costly in terms of performance n Seek compromise data distribution methods that provide good performance with reduced consistency l The Data Distribution Piece Is at Least As Hard, If Not Harder, Than Getting the SIP Pieces Right l Try to Solve This Problem Generally, Not Separately for Each Application SIP 2003 Lessons 6

Network Wide Diagnostics l Problem Statement Joe calls customer service. He says his phone doesn’t work. When asked what the problem was, he reports that his IM never reached his intended target. He sent it yesterday or perhaps the day before. l The Challenge Find the element which failed and identify the specific problem in the deployed production network, without affecting performance of the network. SIP 2003 Lessons 7

Why Is This Challenging? l There Are a Multitude of Elements at the “SIP Layer” n A variety of proxies l Continuous Logging Is Not Possible n Performance implications n A variety of databases n A variety of gateways l There Are a Multitude of Elements at Other Layers n A variety of routers l You Cannot Replicate the State of the Network When the Failure Occurred n Too many users and other variables n A variety of GGSNs (Gateway GPRS Support Node)/PDSNs(Packet Data Serving Node) n A variety of base stations n A variety of ethernet switches SIP 2003 Lessons 8

What is the Solution l Design for Diagnostics l Stimulate Your System l Engineer for Evolution l Know Your Network SIP 2003 Lessons 9

Design for Diagnostics l Extensive “Triggered” Logging n Look for conditions that may indicate an error l SIP transaction timeout l SIP request failure l Database timeout l Corrupted database data n On those conditions, produce mass amounts of trace data l Execution stacks l Message contents l May Need to Store Trace Data in Memory in Sliding Window n Sometimes an error on one place caused an error in another l Careful Draining of Trace Data n Cannot affect runtime performance l Centralized Repository for Trace Data n Don’t want to have to go to each of the machines n Push it to a single place with wellidentified correlation identifiers l Don’t Forget the Handset! n The handset is part of the network n It should generate trace data too upon failure! l Related To, but Not the Same As Fault Management n This is something the network operations guys can’t fix SIP 2003 Lessons 10

Stimulate Your System l The Best Problems Are the Ones You Find Before Your Customers Do! l Look for Problems Through Active “Probing” of the Network n A usage which triggers the logging of data about how it was processed in each element n Usage must be a normal one l Security Issues n Must carefully authenticate the sender of a probed message n Otherwise, a great source of dos and other attacks l Continuously Send Probes n For each use case of your network n For each pop or site l SIP “Probe” Extensions n Vary the transmission times and contents wherever possible n Headers that ask proxies and user agents to generate tracing information about l IETF Work Just Begun message handling n Develop requirements for such n May also designate a destination for probes sending the data n Alternatively, attach it to the message l What if its lost? SIP 2003 Lessons 11

Engineer for Evolution l Once You Find the Bug and Prepare a Fix, What Then? l Need to Upgrade the Affected Servers n Cannot affect run time performance n Must be easy to do (so you can do it often!) n Must be easy to undo l Solution: Automated Software Upgrade l Basic Process n Vendor sends operator a new version n Operator types “install version” at the centralized management console n Console determines which servers are affected n For each server, gracefully terminates it one at a time n Remotely installs upgrade l Old one not removed n Updates configuration files if needed n Remotely verifies upgrade n Restarts server, and goes to the next one l Old Server Versions and Configurations Are Kept, Rollback Is Allowed l Process Must Be Automated and Easy n Model: Quicken SIP 2003 Lessons 12

Know Your Network l Experience Is Ultimately the Only Way to Find Problems l The People Who Design Elements Are Usually Not the Ones Who Have Experience Running Networks of Them l Put Processes in Place to Feed Back Experience to the Developers and Architects SIP 2003 Lessons 13

Non-INVITE UDP Failover l Problem n A SIP non-invite request is sent through a chain of proxies n The final proxy has failed n Upon transaction timeout, each of them P MSG P generates a 408 l The “winning” 408 depends on relative timing n Would like to mark the server as failed Timeout so it is not tried again n How does each proxy know if the failure was its own next-hop, or some other server downstream? l Timeout can occur first anywhere in the chain l Downstream 408 s are discarded because transaction has timed out P 408 P P SIP 2003 Lessons 14

Solutions l Use TCP n TCP will provide a hop-by-hop acknowledgement for the data n If next hop fails, your TCP connection reports errors l Bring Back 100 Responses for Non-invite n Tells a proxy that the next hop got the request n Means proxy was alive at the beginning of the transaction n Next hop considered dead if no 100 is received l Extended Transactions n Two transaction timeouts l Currently defined one l Longer one used to wait for 408 responses from downstream nodes n If 408 is received before second timeout, but after first, failure is not the next hop n If no 408 is received before second timeout, downstream element has failed n Con: additional memory requirements for holding on to state of the transaction n Con: extra message traffic l Conclusion: Needs to Be Worked in IETF SIP 2003 Lessons 15

Summary l Building a Large Scale Distributed SIP Network Is Hard l Many of the Problems Are Not Specific to SIP, and Show up in Any Similar System n IP networks n Email networks l Key General Lessons n Data distribution is hard n Worry about diagnostics l SIP Lesson n Non-invite failover problem SIP 2003 Lessons 16

Information Resource Jonathan Rosenberg Chief Scientist +1 973. 952. 5000 jdrosen@dynamicsoft. com