Web Servers Implementation and Performance Erich Nahum IBM

Web Servers: Implementation and Performance Erich Nahum IBM T. J. Watson Research Center www. research. ibm. com/people/n/nahum@us. ibm. com Web Servers: Implementation and Performance Erich Nahum 1

Contents of This Tutorial • Introduction to HTTP • HTTP Servers: – Outline of an HTTP Server Transaction – Server Models: Processes, Threads, Events – Event Notification: Asynchronous I/O • HTTP Server Workloads: – Workload Characteristics – Workload Generation • Server TCP Issues – Introduction to TCP – Server TCP Dynamics – Server TCP Implementation Issues • Other Issues (time permitting): – – Large Site Studies Clusters Running Experiments Brief Overview of Other Topics Web Servers: Implementation and Performance Erich Nahum 2

Things Not Covered in Tutorial • • Client-side issues: DNS, HTML rendering Proxies: some similarities, many differences Dynamic Content: CGI, PHP, ASP, etc. Qo. S for Web Servers SSL/TLS and HTTPS Content Distribution Networks (CDN’s) Security and Denial of Service If time is available, may cover briefly at the end Web Servers: Implementation and Performance Erich Nahum 3

Assumptions and Expectations • Some familiarity with WWW as a user (Has anyone here not used a browser? ) • Some familiarity with networking concepts (e. g. , unreliability, reordering, race conditions) • Familiarity with systems programming (e. g. , know what sockets, hashing, caching are) • Examples will be based on C & Unix taken from BSD, Linux, AIX, and real servers (sorry, Java and Windows fans) Web Servers: Implementation and Performance Erich Nahum 4

Objectives and Takeaways After this tutorial, hopefully we will all know: • • • Basics of server implementation & performance Pros and cons of various server architectures Difficulties in workload generation Interactions between HTTP and TCP Design loop of implement, measure, profile, debug, and fix Many lessons should be applicable to any networked server, e. g. , files, mail, news, DNS, LDAP, etc. Web Servers: Implementation and Performance Erich Nahum 5

Timeline • • • Intro, HTTP, server transaction: 40 min Server models, event notification: 40 min Workload characterization & generation: 40 min Intro to TCP, dynamics, implementation: 40 min Clusters, large site studies, experiments: 30 min Other topics: time permitting Web Servers: Implementation and Performance Erich Nahum 6

Acknowledgements Many people contributed comments and suggestions to this tutorial, including: Abhishek Chandra Mark Crovella Suresh Chari Peter Druschel Jim Kurose Balachander Krishnamurthy Vivek Pai Jennifer Rexford Anees Shaikh Errors are all mine, of course. Web Servers: Implementation and Performance Erich Nahum 7

Chapter 1: Introduction to HTTP Web Servers: Implementation and Performance Erich Nahum 8

Introduction to HTTP Laptop w/ Netscape http request http response Desktop w/ Explorer Server w/ Apache • HTTP: Hypertext Transfer Protocol – Communication protocol between clients and servers – Application layer protocol for WWW • Client/Server model: – Client: browser that requests, receives, displays object – Server: receives requests and responds to them • Protocol consists of various operations – Few for HTTP 1. 0 (RFC 1945, 1996) – Many more in HTTP 1. 1 (RFC 2616, 1999) Web Servers: Implementation and Performance Erich Nahum 9

How are Requests Generated? • User clicks on something • Uniform Resource Locator (URL): – – – http: //www. nytimes. com https: //www. paymybills. com ftp: //ftp. kernel. org news: //news. deja. com telnet: //gaia. cs. umass. edu mailto: nahum@us. ibm. com • Different URL schemes map to different services • Hostname is converted from a name to a 32 -bit IP address (DNS resolve) • Connection is established to server Most browser requests are HTTP requests. Web Servers: Implementation and Performance Erich Nahum 10

What Happens Then? • Client downloads HTML document – Sometimes called “container page” – Typically in text format (ASCII) – Contains instructions for rendering (e. g. , background color, frames) – Links to other pages • Many have embedded objects: – Images: GIF, JPG (logos, banner ads) – Usually automatically retrieved • I. e. , without user involvement • can control sometimes (e. g. browser options, junkbusters) Web Servers: Implementation and Performance Erich Nahum <html> <head> <meta name=“Author” content=“Erich Nahum”> <title> Linux Web Server Performance </title> </head> <body text=“#00000”> <img width=31 height=11 src=“ibmlogo. gif”> <img src=“images/new. gif> <h 1>Hi There!</h 1> Here’s lots of cool linux stuff! <a href=“more. html”> Click here</a> for more! </body> </html> sample html file 11

So What’s a Web Server Do? • Respond to client requests, typically a browser – Can be a proxy, which aggregates client requests (e. g. , AOL) – Could be search engine spider or custom (e. g. , Keynote) • May have work to do on client’s behalf: – – Is the client’s cached copy still good? Is client authorized to get this document? Is client a proxy on someone else’s behalf? Run an arbitrary program (e. g. , stock trade) • Hundreds or thousands of simultaneous clients • Hard to predict how many will show up on some day • Many requests are in progress concurrently Server capacity planning is non-trivial. Web Servers: Implementation and Performance Erich Nahum 12

What do HTTP Requests Look Like? GET /images/penguin. gif HTTP/1. 0 User-Agent: Mozilla/0. 9. 4 (Linux 2. 2. 19) Host: www. kernel. org Accept: text/html, image/gif, image/jpeg Accept-Encoding: gzip Accept-Language: en Accept-Charset: iso-8859 -1, *, utf-8 Cookie: B=xh 203 jfsf; Y=3 sdkfjej <cr><lf> • Messages are in ASCII (human-readable) • Carriage-return and line-feed indicate end of headers • Headers may communicate private information (browser, OS, cookie information, etc. ) Web Servers: Implementation and Performance Erich Nahum 13

What Kind of Requests are there? Called Methods: • GET: retrieve a file (95% of requests) • HEAD: just get meta-data (e. g. , mod time) • POST: submitting a form to a server • PUT: store enclosed document as URI • DELETE: removed named resource • LINK/UNLINK: in 1. 0, gone in 1. 1 • TRACE: http “echo” for debugging (added in 1. 1) • CONNECT: used by proxies for tunneling (1. 1) • OPTIONS: request for server/proxy options (1. 1) Web Servers: Implementation and Performance Erich Nahum 14

What Do Responses Look Like? HTTP/1. 0 200 OK Server: Tux 2. 0 Content-Type: image/gif Content-Length: 43 Last-Modified: Fri, 15 Apr 1994 02: 36: 21 GMT Expires: Wed, 20 Feb 2002 18: 54: 46 GMT Date: Mon, 12 Nov 2001 14: 29: 48 GMT Cache-Control: no-cache Pragma: no-cache Connection: close Set-Cookie: PA=wefj 2 we 0 -jfjf <cr><lf> <data follows…> • Similar format to requests (i. e. , ASCII) Web Servers: Implementation and Performance Erich Nahum 15

What Responses are There? • 1 XX: Informational (def’d in 1. 0, used in 1. 1) 100 Continue, 101 Switching Protocols • 2 XX: Success 200 OK, 206 Partial Content • 3 XX: Redirection 301 Moved Permanently, 304 Not Modified • 4 XX: Client error 400 Bad Request, 403 Forbidden, 404 Not Found • 5 XX: Server error 500 Internal Server Error, 503 Service Unavailable, 505 HTTP Version Not Supported Web Servers: Implementation and Performance Erich Nahum 16

What are all these Headers? Specify capabilities and properties: • General: Connection, Date • Request: Accept-Encoding, User-Agent • Response: Location, Server type • Entity: Content-Encoding, Last-Modified • Hop-by-hop: Proxy-Authenticate, Transfer-Encoding Server must pay attention to respond properly. Web Servers: Implementation and Performance Erich Nahum 17

Summary: Introduction to HTTP • The major application on the Internet – Majority of traffic is HTTP (or HTTP-related) • Client/server model: – Clients make requests, servers respond to them – Done mostly in ASCII text (helps debugging!) • Various headers and commands – Too many to go into detail here – We’ll focus on common server ones – Many web books/tutorials exist (e. g. , Krishnamurthy & Rexford 2001) Web Servers: Implementation and Performance Erich Nahum 18

Chapter 2: Outline of a Typical HTTP Transaction Web Servers: Implementation and Performance Erich Nahum 19

Outline of an HTTP Transaction • In this section we go over the basics of servicing an HTTP GET request from user space • For this example, we'll assume a single process running in user space, similar to Apache 1. 3 • At each stage see what the costs/problems can be • Also try to think of where costs can be optimized • We’ll describe relevant socket operations as we go Web Servers: Implementation and Performance Erich Nahum initialize; forever do { get request; process; send response; log request; } server in a nutshell 20

Readying a Server s = socket(); bind(s, 80); listen(s); while (1) { newconn = /* allocate listen socket */ /* bind to TCP port 80 */ /* indicate willingness to accept */ accept(s); /* accept new connection */b • First thing a server does is notify the OS it is interested in WWW server requests; these are typically on TCP port 80. Other services use different ports (e. g. , SSL is on 443) • Allocate a socket and bind()'s it to the address (port 80) • Server calls listen() on the socket to indicate willingness to receive requests • Calls accept() to wait for a request to come in (and blocks) • When the accept() returns, we have a new socket which represents a new connection to a client Web Servers: Implementation and Performance Erich Nahum 21

Processing a Request remote. IP = getsockname(newconn); remote. Host = gethostbyname(remote. IP); gettimeofday(current. Time); read(newconn, req. Buffer, sizeof(req. Buffer)); req. Info = server. Parse(req. Buffer); • getsockname() called to get the remote host name • gethostbyname() called to get name of other end • gettimeofday() is called to get time of request • • read() is called on new socket to retrieve request is determined by parsing the data – for logging purposes (optional, but done by most) – again for logging purposes – both for Date header and for logging – “GET /images/jul 4/flag. gif” Web Servers: Implementation and Performance Erich Nahum 22

Processing a Request (cont) file. Name = parse. Out. File. Name(request. Buffer); file. Attr = stat(file. Name); server. Check. File. Stuff(file. Name, file. Attr); open(file. Name); • stat() called to test file path – to see if file exists/is accessible – may not be there, may only be available to certain people – "/microsoft/top-secret/plans-for-world-domination. html" • stat() also used for file meta-data – e. g. , size of file, last modified time – "Have plans changed since last time I checked? “ • might have to stat() multiple files just to get to end – e. g. , 4 stats in bill g example above • assuming all is OK, open() called to open the file Web Servers: Implementation and Performance Erich Nahum 23

Responding to a Request read(file. Name, file. Buffer); header. Buffer = server. Figure. Headers(file. Name, req. Info); write(new. Sock, header. Buffer); write(new. Sock, file. Buffer); close(new. Sock); close(file. Name); write(log. File, request. Info); • • read() called to read the file into user space write() is called to send HTTP headers on socket (early servers called write() for each header!) • • write() is called to write the file on the socket close() is called to close the open file descriptor write() is called on the log file Web Servers: Implementation and Performance Erich Nahum 24

Optimizing the Basic Structure • As we will see, a great deal of locality exists in web requests and web traffic. • Much of the work described above doesn't really need to be performed each time. • Optimizations fall under 2 categories: caching and custom OS primitives. Web Servers: Implementation and Performance Erich Nahum 25

Optimizations: Caching Idea is to exploit locality in client requests. Many files are requested over and over (e. g. , index. html). • Why open and close files over and over again? Instead, cache open file FD’s, manage them LRU. file. Descriptor = look. In. FDCache(file. Name); meta. Info = look. In. Meta. Info. Cache(file. Name); header. Buffer = look. In. HTTPHeader. Cache(file. Name); • Why stat them again • Again, cache HTTP header and again? Cache path info on a per-url basis, name and access rather than re-generating characteristics. info over and over. Web Servers: Implementation and Performance Erich Nahum 26

Optimizations: Caching (cont) • Instead of reading and writing the data, cache data, as well as meta-data, in user space • Even better, mmap() the file so that two copies don't exist in both user and kernel space file. Data = look. In. File. Data. Cache(file. Name); file. Data = look. In. MMap. Cache(file. Name); remote. Host. Name = look. Remote. Host. Cache(file. Name); • Since we see the same clients over and over, cache the reverse name lookups (or better yet, don't do resolves at all, log only IP addresses) Web Servers: Implementation and Performance Erich Nahum 27

Optimizations: OS Primitives • Rather than call accept(), getsockname() & read(), add a new primitive, accept. Extended(), which combines the 3 primitives • Instead of calling gettimeofday(), use a memory-mapped counter that is cheap to access (a few instructions rather than a system call) accept. Extended(listen. Sock, &new. Sock, read. Buffer, &remote. Info); current. Time = *mapped. Time. Pointer; buffer[0] = first. HTTPHeader; buffer[1] = second. HTTPHeader; buffer[2] = file. Data. Buffer; writev(new. Sock, buffer, 3); • Instead of calling write() many times, use writev() Web Servers: Implementation and Performance Erich Nahum 28

OS Primitives (cont) • Rather than calling read() & write(), or write() with an mmap()'ed file, use a new primitive called sendfile() (or transmitfile()). Bytes stay in the kernel. • While we're at it, add a header option to sendfile() so that we don't have to call write() at all. http. Info = cache. Lookup(req. Buffer); sendfile(new. Conn, http. Info->headers, http. Info->file. Descriptor, OPT_CLOSE_WHEN_DONE); • Also add an option to close the connection so that we don't have to call close() explicitly. All this assumes proper OS support. Most have it these days. Web Servers: Implementation and Performance Erich Nahum 29

An Accelerated Server Example acceptex(socket, new. Conn, req. Buffer, remote. Host. Info); http. Info = cache. Lookup(req. Buffer); sendfile(new. Conn, http. Info->headers, http. Info->file. Descriptor, OPT_CLOSE_WHEN_DONE); write(log. File, request. Info); • acceptex() is called – gets new socket, request, remote host IP address • string match in hash table is done to parse request – hash table entry contains relevant meta-data, including modification times, file descriptors, permissions, etc. • sendfile() is called – pre-computed header, file descriptor, and close option • log written back asynchronously (buffered write()). That’s it! Web Servers: Implementation and Performance Erich Nahum 30

Complications • Much of this assumes sharing is easy: – but, this is dependent on the server architectural model – if multiple processes are being used, as in Apache, it is difficult to share data structures. • Take, for example, mmap(): – mmap() maps a file into the address space of a process. – a file mmap'ed in one address space can’t be re-used for a request for the same file served by another process. – Apache 1. 3 does use mmap() instead of read(). – in this case, mmap() eliminates one data copy versus a separate read() & write() combination, but process will still need to open() and close() the file. Web Servers: Implementation and Performance Erich Nahum 31

Complications (cont) • Similarly, meta-data info needs to be shared: – e. g. , file size, access permissions, last modified time, etc. • While locality is high, cache misses can and do happen sometimes: – if previously unseen file requested, process can block waiting for disk. • OS can impose other restrictions: – e. g. , limits on number of open file descriptors. – e. g. , sockets typically allow buffering about 64 KB of data. If a process tries to write() a 1 MB file, it will block until other end receives the data. • Need to be able to cope with the misses without slowing down the hits Web Servers: Implementation and Performance Erich Nahum 32

Summary: Outline of a Typical HTTP Transaction • A server can perform many steps in the process of servicing a request • Different actions depending on many factors: – e. g. , 304 not modified if client's cached copy is good – e. g. , 404 not found, 401 unauthorized • Most requests are for small subset of data: – we’ll see more about this in the Workload section – we can leverage that fact for performance • Architectural model affects possible optimizations – we’ll go into this in more detail in the next section Web Servers: Implementation and Performance Erich Nahum 33

Chapter 3: Server Architectural Models Web Servers: Implementation and Performance Erich Nahum 34

Server Architectural Models Several approaches to server structure: • Process based: Apache, NCSA • Thread-based: JAWS, IIS • Event-based: Flash, Zeus • Kernel-based: Tux, AFPA, Exo. Kernel We will describe the advantages and disadvantages of each. Fundamental tradeoffs exist between performance, protection, sharing, robustness, extensibility, etc. Web Servers: Implementation and Performance Erich Nahum 35

Process Model (ex: Apache) • Process created to handle each new request: – Process can block on appropriate actions, (e. g. , socket read, file read, socket write) – Concurrency handled via multiple processes • Quickly becomes unwieldy: – Process creation is expensive. – Instead, pre-forked pool is created. – Upper limit on # of processes is enforced • First by the server, eventually by the operating system. • Concurrency is limited by upper bound Web Servers: Implementation and Performance Erich Nahum 36

Process Model: Pros and Cons • Advantages: – Most importantly, consistent with programmer's way of thinking. Most programmers think in terms of linear series of steps to accomplish task. – Processes are protected from one another; can't nuke data in some other address space. Similarly, if one crashes, others unaffected. • Disadvantages: – Slow. Forking is expensive, allocating stack, VM data structures for each process adds up and puts pressure on the memory system. – Difficulty in sharing info across processes. – Have to use locking. – No control over scheduling decisions. Web Servers: Implementation and Performance Erich Nahum 37

Thread Model (Ex: JAWS) • Use threads instead of processes. Threads consume fewer resources than processes (e. g. , stack, VM allocation). • Forking and deleting threads is cheaper than processes. • Similarly, pre-forked thread pool is created. May be limits to numbers but hopefully less of an issue than with processes since fewer resources required. Web Servers: Implementation and Performance Erich Nahum 38

Thread Model: Pros and Cons • Advantages: – Faster than processes. Creating/destroying cheaper. – Maintains programmer's way of thinking. – Sharing is enabled by default. • Disadvantages: – Less robust. Threads not protected from each other. – Requires proper OS support, otherwise, if one thread blocks on a file read, will block all the address space. – Can still run out of threads if servicing many clients concurrently. – Can exhaust certain per-process limits not encountered with processes (e. g. , number of open file descriptors). – Limited or no control over scheduling decisions. Web Servers: Implementation and Performance Erich Nahum 39

Event Model (Ex: Flash) while (1) { accept new connections until none remaining; call select() on all active file descriptors; for each FD: if (fd ready for reading) call read(); if (fd ready for writing) call write(); } • Use a single process and deal with requests in a event-driven manner, like a giant switchboard. • Use non-blocking option (O_NDELAY) on sockets, do everything asynchronously, never block on anything, and have OS notify us when something is ready. Web Servers: Implementation and Performance Erich Nahum 40

Event-Driven: Pros and Cons • Advantages: – – – Very fast. Sharing is inherent, since there’s only one process. Don't even need locks as in thread models. Can maximize concurrency in request stream easily. No context-switch costs or extra memory consumption. Complete control over scheduling decisions. • Disadvantages: – Less robust. Failure can halt whole server. – Pushes per-process resource limits (like file descriptors). – Not every OS has full asynchronous I/O, so can still block on a file read. Flash uses helper processes to deal with this (AMPED architecture). Web Servers: Implementation and Performance Erich Nahum 41

In-Kernel Model (Ex: Tux) HTTP SOCK user/ kernel boundary HTTP user/ kernel boundary TCP IP IP ETH user-space server kernel-space server • Dedicated kernel thread for HTTP requests: – One option: put whole server in kernel. – More likely, just deal with static GET requests in kernel to capture majority of requests. – Punt dynamic requests to full-scale server in user space, such as Apache. Web Servers: Implementation and Performance Erich Nahum 42

In-Kernel Model: Pros and Cons • In-kernel event model: – Avoids transitions to user space, copies across u-k boundary, etc. – Leverages already existing asynchronous primitives in the kernel (kernel doesn't block on a file read, etc. ) • Advantages: – Extremely fast. Tight integration with kernel. – Small component without full server optimizes common case. • Disadvantages: – Less robust. Bugs can crash whole machine, not just server. – Harder to debug and extend, since kernel programming required, which is not as well-known as sockets. – Similarly, harder to deploy. APIs are OS-specific (Linux, BSD, NT), whereas sockets & threads are (mostly) standardized. – HTTP evolving over time, have to modify kernel code in response. Web Servers: Implementation and Performance Erich Nahum 43

So What’s the Performance? • Graph shows server throughput for Tux, Flash, and Apache. • Experiments done on 400 MHz P/II, gigabit Ethernet, Linux 2. 4. 9 -ac 10, 8 client machines, Wasp. Client workload generator • Tux is fastest, but Flash close behind Web Servers: Implementation and Performance Erich Nahum 44

Summary: Server Architectures • Many ways to code up a server – Tradeoffs in speed, safety, robustness, ease of programming and extensibility, etc. • Multiple servers exist for each kind of model – Not clear that a consensus exists. • Better case for in-kernel servers as devices e. g. reverse proxy accelerator, Akamai CDN node • User-space servers have a role: – OS should provides proper primitives for efficiency – Leave HTTP-protocol related actions in user-space – In this case, event-driven model is attractive • Key pieces to a fast event-driven server: – Minimize copying – Efficient event notification mechanism Web Servers: Implementation and Performance Erich Nahum 45

Chapter 4: Event Notification Web Servers: Implementation and Performance Erich Nahum 46

Event Notification Mechanisms • Recall how Flash works: – One process, many FD's, calling select() on all active socket descriptors. – All sockets are set using O_NDELAY flag (non-blocking) – Single address space aids sharing for performance – File reads and writes don't have non-blocking support, thus helper processes (AMPED architecture) • Point is to exploit concurrency/parallelism: – Can read one socket while waiting to write on another • Event notification: – Mechanism for kernel and application to notify each other of interesting/important events – E. g. , connection arrivals, socket closes, data available to read, space available for writing Web Servers: Implementation and Performance Erich Nahum 47

State-Based: Select & Poll • select() and poll(): – State-based: Is socket ready for reading/writing? – select() interface has FD_SET bitmasks turned on/off based on interest – poll() is simple array, larger structure but simpler implementation • Performance costs: – Kernel scans O(N) descriptors to set bits – User application scans O(N) descriptors – select() bit manipulation can be expensive • Problems: – Traffic is bursty, connections not active all at once • # (active connections) << # (open connections). • Costs are O(total connections), not O(active connections) – Application keeps specifying interest set repeatedly Web Servers: Implementation and Performance Erich Nahum 48

Event-Based Notification Banga, Mogul & Druschel (USENIX 99) • Propose an event based approach, rather than state-based: – Something just happened on socket X, rather than socket X is ready for reading or writing – Server takes event as indication socket might be ready – Multiple events can happen on a single socket (e. g. , packets draining (implying writeable) or accumulating (readable)) • API has following: – Application notifies kernel by calling declare_interest() once per file descriptor (e. g. , after accept()), rather than multiple times like in select()/poll() – Kernel queues events internally – Application calls get_next_event() to see changes Web Servers: Implementation and Performance Erich Nahum 49

Event-Based Notification (cont) • Problems: – Kernel has to allocate storage for event queue. Little's law says it needs to be proportional to the event rate – Bursty applications could overflow queue – Can address multiple events by coalescing based on FD – Results in storage O(total connections). • Application has to change the way it thinks: – Respond to events, instead of checking state. – If events are missed, connections might get stuck. • Evaluation shows it scales nicely: – cost is O(active) not O(total) • Windows NT has something similar: – called IO completion ports Web Servers: Implementation and Performance Erich Nahum 50

Notification in the Real World POSIX Real-Time Signals: – Different concept: Unix signals are invoked when something is ready on a file descriptor. – Signals are expensive and difficult to control (e. g. , no ordering), so applications can suppress signals and then retrieve them via sigwaitinfo() – If signal queue fills up, events will be dropped. A separate signal is raised to notify application about signal queue overflow. Problems: – If signal queue overflows, then app must fall back on state-based approach. Chandra and Mosberger propose signal-per-fd (coalescing events per file descriptor). – Only one event is retrieved at a time: Provos and Lever propose sigtimedwait 4() to retrieve multiple signals at once Web Servers: Implementation and Performance Erich Nahum 51

Notification in the Real World • Sun's /dev/poll: – App notifies kernel by writing to special file /dev/poll to express interest – App does IOCTL on /dev/poll for list of ready FD's – App and kernel are still both state based – Kernel still pays O(total connections) to create FD list • Libenzi’s /dev/epoll (patch for Linux 2. 4): – Uses /dev/epoll as interface, rather than /dev/poll – Application writes interest to /dev/epoll and IOCTL's to get events – Events are coalesced on a per-FD basis – Semantically identical to RT signals with sig-per-fd & sigtimedwait 4(). Web Servers: Implementation and Performance Erich Nahum 52

Real File Asynchronous I/O • Like setting O_NDELAY (non-blocking) on file descriptors: – Application can queue reads and writes on FDs and pick them up later (like dry cleaning) – Requires support in the file system (e. g. , callbacks) • Currently doesn't exist on many OS's: – POSIX specification exists – Solaris has non-standard version – Linux has it slated for 2. 5 kernel • Two current candidates on Linux: – SGI's /dev/kaio and Ben Le. Haises's /dev/aio • Proper implementation would allow Flash to eliminate helpers Web Servers: Implementation and Performance Erich Nahum 53

Summary: Event Notification • Goal is to exploit concurrency – Concurrency in user workloads means host CPU can overlap multiple events to maximize parallelism – Keep network, disk busy; never block • Event notification changes applications: – state-based to event-based – requires a change in thinking • Goal is to minimize costs: – user/kernel crossings and testing idle socket descriptors • Event-based notification not yet fully deployed: – Most mechanisms only support network I/O, not file I/O – Full deployment of Asynchronous I/O spec should fix this Web Servers: Implementation and Performance Erich Nahum 54

Chapter 5: Workload Characterization Web Servers: Implementation and Performance Erich Nahum 55

Workload Characterization • Why Characterize Workloads? – Gives an idea about traffic behavior ("Which documents are users interested in? ") – Aids in capacity planning ("Is the number of clients increasing over time? ") – Aids in implementation ("Does caching help? ") • How do we capture them ? – Through server logs (typically enabled) – Through packet traces (harder to obtain and to process) Web Servers: Implementation and Performance Erich Nahum 56

Factors to Consider client? proxy? server? • Where do I get logs from? – Client logs give us an idea, but not necessarily the same – Same for proxy logs – What we care about is the workload at the server • Is trace representative? – Corporate POP vs. News vs. Shopping site • What kind of time resolution? – e. g. , second, millisecond, microsecond • Does trace/log capture all the traffic? – e. g. , incoming link only, or one node out of a cluster Web Servers: Implementation and Performance Erich Nahum 57

Probability Refresher • Lots of variability in workloads • Some terminology/jargon: • Heavy-tailed: – – – Use probability distributions to express Want to consider many factors Mean: average of samples Median : half are bigger, half are smaller Percentiles: dump samples into N bins (median is 50 th percentile number) As x->infinity Web Servers: Implementation and Performance Erich Nahum 58

Important Distributions Some Frequently-Seen Distributions: • Normal: • Lognormal: • Exponential: • Pareto: – – (avg. sigma, variance mu) (x >= 0; sigma > 0) (x >= k, shape a, scale k) Web Servers: Implementation and Performance Erich Nahum 59

More Probability • Graph shows 3 distributions with average = 2. • Note average median in all cases ! • Different distributions have different “weight” in tail. Web Servers: Implementation and Performance Erich Nahum 60

What Info is Useful? • Request methods – GET, POST, HEAD, etc. • Response codes • • • – success, failure, not-modified, etc. Size of requested files Size of transferred objects Popularity of requested files Numbers of embedded objects Inter-arrival time between requests Protocol support (1. 0 vs. 1. 1) Web Servers: Implementation and Performance Erich Nahum 61

Sample Logs for Illustration Name: Chess 1997 Olympics 1998 IBM 2001 Description: Kasparov. Deep Blue Event Site Nagano 1998 Olympics Event Site Corporate Presence Period: 2 weeks in May 1997 2 days in Feb 1998 1 day in June 1998 1 day in Feb 2001 Hits: 1, 586, 667 5, 800, 000 11, 485, 600 12, 445, 739 Bytes: 14, 171, 711 10, 515, 507 54, 697, 108 28, 804, 852 Clients: 256, 382 80, 921 86, 0211 319, 698 URLS: 2, 293 30, 465 15, 788 42, 874 We’ll use statistics generated from these logs as examples. Web Servers: Implementation and Performance Erich Nahum 62

Request Methods GET Chess 1997 96% Olympics IBM 1998 99. 6% 99. 3% HEAD 04% 00. 3 % POST 00. 007% 00. 04 % Others: noise IBM 2001 97% 00. 08% 02% 00. 2% noise • KR 01: "overwhelming majority" are GETs, few POSTs • IBM 2001 trace starts seeing a few 1. 1 methods (CONNECT, OPTIONS, LINK), but still very small (1/10^5 %) Web Servers: Implementation and Performance Erich Nahum 63

Response Codes Code Meaning Chess 1997 Olympics 1998 IBM 2001 200 204 206 301 302 304 400 401 403 404 407 500 501 503 ? ? ? OK NO_CONTENT PARTIAL_CONTENT MOVED_PERMANENTLY MOVED_TEMPORARILY NOT_MODIFIED BAD_REQUEST UNAUTHORIZED FORBIDDEN NOT_FOUND PROXY_AUTH SERVER_ERROR NOT_IMPLEMENTED SERVICE_UNAVAIL UNKNOWN 85. 32 --. -00. 25 00. 05 13. 73 00. 001 --. — 00. 01 00. 55 --. ---. -00. 0003 76. 02 --. ---. -00. 05 23. 24 00. 0001 00. 02 00. 64 --. -00. 003 00. 0001 --. -00. 00004 75. 28 00. 00001 --. -01. 18 22. 84 00. 003 00. 0001 00. 65 --. -00. 006 00. 0005 00. 0001 00. 005 67. 72 --. ---. -15. 11 16. 26 00. 001 00. 009 00. 79 00. 002 00. 07 00. 006 00. 0003 00. 0004 • Table shows percentage of responses. • Majority are OK and NOT_MODIFIED. • Consistent with numbers from AW 96, KR 01. Web Servers: Implementation and Performance Erich Nahum 64

Resource (File) Sizes • • • Shows file/memory usage (not weighted by frequency!) Lognormal body, consistent with results from AW 96, CB 96, KR 01. AW 96, CB 96: sizes have Pareto tail; Downey 01: Sizes are lognormal. Web Servers: Implementation and Performance Erich Nahum 65

Tails from the File Size • Shows the complementary CDF (CCDF) of file sizes. • Haven’t done the curve fitting but looks Pareto-ish. Web Servers: Implementation and Performance Erich Nahum 66

Response (Transfer) Sizes • Shows network usage (weighted by frequency of requests) • Lognormal body, pareto tail, consistent with CBC 95, AW 96, CB 96, KR 01 Web Servers: Implementation and Performance Erich Nahum 67

Tails of Transfer Size • Shows the complementary CDF (CCDF) of file sizes. • Looks somewhat Pareto-like; certainly some big transfers. Web Servers: Implementation and Performance Erich Nahum 68

$Resource Popularity • Follows a Zipf model: p(r) = r^{-alpha} • • Consistent with$

Resource Popularity • Follows a Zipf model: p(r) = r^{-alpha} • • Consistent with CBC 95, AW 96, CB 96, PQ 00, KR 01 Shows that caching popular documents is very effective (alpha = 1 true Zipf; others “Zipf-like") Web Servers: Implementation and Performance Erich Nahum 69

Number of Embedded Objects • Mah 97: avg 3, 90% are 5 or less • BC 98: pareto distr, median 0. 8, mean 1. 7 • Arlitt 98 World Cup study: median 15 objects, 90% are 20 or less • MW 00: median 7 -17, mean 11 -18, 90% 40 or less • STA 00: median 5, 30 (2 traces), 90% 50 or less • Mah 97, BC 98, SCJO 01: embedded objects tend to be smaller than container objects • KR 01: median is 8 -20, pareto distribution Trend seems to be that number is increasing over time. Web Servers: Implementation and Performance Erich Nahum 70

Session Inter-Arrivals • Inter-arrival time between successive requests – “Think time" – difference between user requests vs. ALL requests – partly depends on definition of boundary • CB 96: variability across multiple timescales, "selfsimilarity", average load very different from peak or heavy load • SCJO 01: log-normal, 90% less than 1 minute. • AW 96: independent and exponentially distributed • KR 01: pareto with a=1. 5, session arrivals follow poisson distribution, but requests follow pareto Web Servers: Implementation and Performance Erich Nahum 71

Protocol Support • IBM. com 2001 logs: – Show roughly 53% of client requests are 1. 1 • KA 01 study: – 92% of servers claim to support 1. 1 (as of Sep 00) – Only 31% actually do; most fail to comply with spec • SCJO 01 show: – Avg 6. 5 requests persistent connection – 65% have 2 connections per page, rest more. – 40 -50% of objects downloaded by persistent connections Appears that we are in the middle of a slow transition to 1. 1 Web Servers: Implementation and Performance Erich Nahum 72

Summary: Workload Characterization • Traffic is variable: – Responses vary across multiple orders of magnitude • Traffic is bursty: – Peak loads much larger than average loads • Certain files more popular than others – Zipf-like distribution captures this well • Two-sided aspect of transfers: – Most responses are small (zero pretty common) – Most of the bytes are from large transfers • Controversy over Pareto/log-normal distribution • Non-trivial for workload generators to replicate Web Servers: Implementation and Performance Erich Nahum 73

Chapter 6: Workload Generators Web Servers: Implementation and Performance Erich Nahum 74

Why Workload Generators? • Allows stress-testing and bug -finding • Gives us some idea of server capacity • Allows us a scientific process to compare approaches – e. g. , server models, gigabit adaptors, OS implementations • Assumption is that difference in testbed translates to some difference in real-world • Allows the performance debugging cycle Web Servers: Implementation and Performance Measure Fix and/or improve Reproduce Find Problem The Performance Debugging Cycle Erich Nahum 75

Problems with Workload Generators • Only as good as our understanding of the traffic • Traffic may change over time – generators must too • May not be representative – e. g. , are file size distributions from IBM. com similar to mine? • May be ignoring important factors – e. g. , browser behavior, WAN conditions, modem connectivity • Still, useful for diagnosing and treating problems Web Servers: Implementation and Performance Erich Nahum 76

How does W. Generation Work? • Many clients, one server – match asymmetry of Internet • Server is populated with some kind of synthetic content • Simulated clients produce requests for server • Master process to control clients, aggregate results • Goal is to measure server – not the client or network Requests Responses • Must be robust to conditions – e. g. , if server keeps sending 404 not found, will clients notice? Web Servers: Implementation and Performance Erich Nahum 77

Evolution: Web. Stone • • • The original workload generator from SGI in 1995 Process based workload generator, implemented in C Clients talk to master via sockets Configurable: # client machines, # client processes, run time Measured several metrics: avg + max connect time, response time, throughput rate (bits/sec), # pages, # files • 1. 0 only does GETS, CGI support added in 2. 0 • Static requests, 5 different file sizes: Percentage Size 35. 00 500 B 50. 00 5 KB 14. 00 50 KB 0. 90 500 KB 0. 10 5 MB Web Servers: Implementation and Performance www. mindcraft. com/webstone Erich Nahum 78

Evolution: SPECWeb 96 • Developed by SPEC – Systems Performance Evaluation Consortium – Non-profit group with many benchmarks (CPU, FS) • Attempt to get more representative – Based on logs from NCSA, HP, Hal Computers • 4 classes of files: Percentage Size 35. 00 0 -1 KB 50. 00 1 -10 KB 14. 00 10 -100 KB 1. 00 100 KB – 1 MB • Poisson distribution between each class Web Servers: Implementation and Performance Erich Nahum 79

SPECWeb 96 (cont) • Notion of scaling versus load: – number of directories in data set size doubles as expected throughput quadruples (sqrt(throughput/5)*10) – requests spread evenly across all application directories • Process based WG • Clients talk to master via RPC's (less robust) • Still only does GETS, no keep-alive www. spec. org/osg/web 96 Web Servers: Implementation and Performance Erich Nahum 80

Evolution: SURGE • Scalable URL Reference GEnerator – Barford & Crovella at Boston University CS Dept. • Much more worried about representativeness, captures: – – – server file size distributions, request size distribution, relative file popularity embedded file references temporal locality of reference idle periods ("think times") of users • Process/thread based WG Web Servers: Implementation and Performance Erich Nahum 81

SURGE (cont) • Notion of “user-equivalent”: – statistical model of a user – active “off” time (between URLS), – inactive “off” time (between pages) • Captures various levels of burstiness • Not validated, shows that load generated is different than Spec. Web 96 and has more burstiness in terms of CPU and # active connections www. cs. wisc. edu/~pb Web Servers: Implementation and Performance Erich Nahum 82

Evolution: S-client • Almost all workload generators are closed-loop: – client submits a request, waits for server, maybe thinks for some time, repeat as necessary • Problem with the closed-loop approach: – client can't generate requests faster than the server can respond – limits the generated load to the capacity of the server – in the real world, arrivals don’t depend on server state • i. e. , real users have no idea about load on the server when they click on a site, although successive clicks may have this property – in particular, can't overload the server • s-client tries to be open-loop: – by generating connections at a particular rate – independent of server load/capacity Web Servers: Implementation and Performance Erich Nahum 83

S-Client (cont) • How is s-client open-loop? – connecting asynchronously at a particular rate – using non-blocking connect() socket call • Connect complete within a particular time? – if yes, continue normally. – if not, socket is closed and new connect initiated. • Other details: – uses single-address space event-driven model like Flash – calls select() on large numbers of file descriptors – can generate large loads • Problems: – client capacity is still limited by active FD's – “arrival” is a TCP connect, not an HTTP request www. cs. rice. edu/CS/Systems/Web-measurement Web Servers: Implementation and Performance Erich Nahum 84

Evolution: SPECWeb 99 • In response to people "gaming" benchmark, now includes rules: – IP maximum segment lifetime (MSL) must be at least 60 seconds (more on this later!) – Link-layer maximum transmission unit (MTU) must not be larger than 1460 bytes (Ethernet frame size) – Dynamic content may not be cached • not clear that this is followed – Servers must log requests. • W 3 C common log format is sufficient but not mandatory. – Resulting workload must be within 10% of target. – Error rate must be below 1%. • Metric has changed: – now "number of simultaneous conforming connections“: rate of a connection must be greater than 320 Kbps Web Servers: Implementation and Performance Erich Nahum 85

SPECWeb 99 (cont) • Directory size has changed: (25 + (400000/122000)* simultaneous conns) / 5. 0) • Improved HTTP 1. 0/1. 1 support: – Keep-alive requests (client closes after N requests) – Cookies • Back-end notion of user demographics – Used for ad rotation – Request includes user_id and last_ad • Request breakdown: – – – 70. 00 % static GET 12. 45 % dynamic GET 12. 60 % dynamic GET with custom ad rotation 04. 80 % dynamic POST 00. 15 % dynamic GET calling CGI code Web Servers: Implementation and Performance Erich Nahum 86

SPECWeb 99 (cont) • Other breakdowns: – – 30 % HTTP 1. 0 with no keep-alive or persistence 70 % HTTP 1. 0 with keep-alive to "model" persistence still has 4 classes of file size with Poisson distribution supports Zipf popularity • Client implementation details: – Master-client communication now uses sockets – Code includes sample Perl code for CGI – Client configurable to use threads or processes • Much more info on setup, debugging, tuning • All results posted to web page, – including configuration & back end code www. spec. org/osg/web 99 Web Servers: Implementation and Performance Erich Nahum 87

So how realistic is SPECWeb 99? • We’ll compare a few characteristics: – – – File size distribution (body) File size distribution (tail) Transfer size distribution (body) Transfer size distribution (tail) Document popularity • Visual comparison only – No curve-fitting, r-squared plots, etc. – Point is to give a feel for accuracy Web Servers: Implementation and Performance Erich Nahum 88

Spec. Web 99 vs. File Sizes • Spec. Web 99: In the ballpark, but not very smooth Web Servers: Implementation and Performance Erich Nahum 89

Spec. Web 99 vs. File Size Tail • Spec. Web 99 tail isn’t as long as real logs (900 KB max) Web Servers: Implementation and Performance Erich Nahum 90

Spec. Web 99 vs. Transfer Sizes • Doesn’t capture 304 (not modified) responses • Coarser distribution than real logs (i. e. , not smooth) Web Servers: Implementation and Performance Erich Nahum 91

Spec 99 vs. Transfer Size Tails • Spec. Web 99 does OK, although tail drops off rapidly (and in fact, no file is greater than 1 MB in Spec. Web 99!). Web Servers: Implementation and Performance Erich Nahum 92

Spec 99 vs. Resource Popularity • Spec. Web 99 seems to do a good job, although tail isn’t long enough Web Servers: Implementation and Performance Erich Nahum 93

Evolution: TPC-W • Transaction Processing Council (TPC-W) – – More known for database workloads like TPC-D Metrics include dollars/transaction (unlike SPEC) Provides specification, not source Meant to capture a large e-commerce site – – – web serving, searching, browsing, shopping carts online transaction processing (OLTP) decision support (DSS) secure purchasing (SSL), best sellers, new products customer registration, administrative updates • Models online bookstore • Has notion of scaling per user – 5 MB of DB tables per user – 1 KB per shopping item, 25 KB per item in static images Web Servers: Implementation and Performance Erich Nahum 94

TPC-W (cont) • Remote browser emulator (RBE) – emulates a single user – send HTTP request, parse, wait for thinking, repeat • Metrics: – WIPS: shopping – WIPSb: browsing – WIPSo: ordering • Setups tend to be very large: – multiple image servers, application servers, load balancer – DB back end (typically SMP) – Example: IBM 12 -way SMP w/DB 2, 9 PCs w/IIS: 1 M $ www. tpc. org/tpcw Web Servers: Implementation and Performance Erich Nahum 95

Summary: Workload Generators • Only the beginning. Many other workload generators: – – httperf from HP WAGON from IBM Wasp. Client from IBM Others? • Both workloads and generators change over time: – Both started simple, got more complex – As workload changes, so must generators • No one single "good" generator – Spec. Web 99 seems the favorite (2002 rumored in the works) • Implementation issues similar to servers: – They are networked-based request producers (i. e. , produce GET's instead of 200 OK's). – Implementation affects capacity planning of clients! (want to make sure clients are not bottleneck) Web Servers: Implementation and Performance Erich Nahum 96

Chapter 7: Introduction to TCP Web Servers: Implementation and Performance Erich Nahum 97

Introduction to TCP • Layering is a common principle in network protocol design • TCP is the major transport protocol in the Internet • Since HTTP runs on top of TCP, much interaction between the two • Asymmetry in client-server model puts strain on server-side TCP implementations • Thus, major issue in web servers is TCP implementation and behavior Web Servers: Implementation and Performance Erich Nahum application transport network link physical 98

The TCP Protocol • Connection-oriented, point-to-point protocol: – Connection establishment and teardown phases – ‘Phone-like’ circuit abstraction – One sender, one receiver • Originally optimized for certain kinds of transfer: – Telnet (interactive remote login) – FTP (long, slow transfers) – Web is like neither of these • Lots of work on TCP, beyond scope of this tutorial – e. g. , know of 3 separate TCP tutorials! Web Servers: Implementation and Performance Erich Nahum 99

TCP Protocol (cont) socket layer application writes data TCP send buffer application reads data TCP receive buffer data segment socket layer ACK segment • Provides a reliable, in-order, byte stream abstraction: – – Recover lost packets and detect/drop duplicates Detect and drop bad packets Preserve order in byte stream, no “message boundaries” Full-duplex: bi-directional data flow in same connection – – Flow control: sender will not overwhelm receiver Congestion control: sender will not overwhelm network! Send and receive buffers Congestion and flow control windows • Flow and congestion controlled: Web Servers: Implementation and Performance Erich Nahum 100

The TCP Header Fields enable the following: • Uniquely identifying a connection (4 -tuple of client/server IP address and port numbers) • Identifying a byte range within that connection • Checksum value to detect corruption • Identifying protocol transitions (SYN, FIN) • Informing other side of your state (ACK) Web Servers: Implementation and Performance 32 bits source port # dest port # sequence number acknowledgement number head not UA P R S F len used checksum rcvr window size ptr urgent data Options (variable length) application data (variable length) Erich Nahum 101

Establishing a TCP Connection • Client sends SYN with initial sequence number (ISN) • Server responds with its own SYN w/seq number and ACK of client (ISN+1) (next expected byte) • Client ACKs server's ISN+1 • The ‘ 3 -way handshake’ • All modulo 32 -bit arithmetic client server connect() Web Servers: Implementation and Performance listen() port 80 SYN ( X) Y) + ( SYN (X ACK +1) ACK ( Y time +1) accept() read() Erich Nahum 102

Sending Data socket layer application writes data TCP send buffer application reads data segment ACK segment TCP receive buffer socket layer • Sender puts data on the wire: – Holds copy in case of loss – Sender must observed receiver flow control window – Sender can discard data when ACK is received • Receiver sends acknowledgments (ACKs) – ACKs can be piggybacked on data going the other way – Protocol says receiver should ACK every other packet in attempt to reduce ACK traffic (delayed ACKs) – Delay should not be more than 500 ms. (typically 200) – We’ll see how this causes problems later Web Servers: Implementation and Performance Erich Nahum 103

Preventing Congestion • Sender may not only overrun receiver, but may also overrun intermediate routers: – No way to explicitly know router buffer occupancy, so we need to infer it from packet losses – Assumption is that losses stem from congestion, namely, that intermediate routers have no available buffers • Sender maintains a congestion window: – Never have more than CW of un-acknowledged data outstanding (or RWIN data; min of the two) – Successive ACKs from receiver cause CW to grow. • How CW grows based on which of 2 phases: – Slow-start: initial state. – Congestion avoidance: steady-state. – Switch between the two when CW > slow-start threshold Web Servers: Implementation and Performance Erich Nahum 104

Congestion Control Principles • Lack of congestion control would lead to congestion collapse (Jacobson 88). • Idea is to be a “good network citizen”. • Would like to transmit as fast as possible without loss. • Probe network to find available bandwidth. • In steady-state: linear increase in CW per RTT. • After loss event: CW is halved. • This is called additive increase /multiplicative decrease (AIMD). • Various papers on why AIMD leads to network stability. Web Servers: Implementation and Performance Erich Nahum 105

Slow Start – Loss occurs OR – CW > slow start threshold sender receiver RTT • Initial CW = 1. • After each ACK, CW += 1; • Continue until: nt two segme nts • Then switch to congestion avoidance • If we detect loss, cut CW in half • Exponential increase in window size per RTT Web Servers: Implementation and Performance one segme four segme nts time Erich Nahum 106

Congestion Avoidance Until (loss) { after CW packets ACKed: CW += 1; } ssthresh = CW/2; Depending on loss type: SACK/Fast Retransmit: CW/= 2; continue; Course grained timeout: CW = 1; go to slow start. (This is for TCP Reno/SACK: TCP Tahoe always sets CW=1 after a loss) Web Servers: Implementation and Performance Erich Nahum 107

How are losses recovered? Say packet is lost (data or ACK!) • Coarse-grained Timeout: First done in TCP Tahoe Web Servers: Implementation and Performance Seq=9 2, 8 by timeout – Sender does not receive ACK after some period of time – Event is called a retransmission time-out (RTO) – RTO value is based on estimated round-trip time (RTT) – RTT is adjusted over time using exponential weighted moving average: RTT = (1 -x)*RTT + (x)*sample (x is typically 0. 1) sender receiver tes da ta =100 X ACK loss Seq=9 2, 8 by tes da t a 0 =10 ACK time lost ACK scenario Erich Nahum 108

Fast Retransmit • Receiver expects N, gets N+1: – – sender receiver Immediately sends ACK(N) This is called a duplicate ACK Does NOT delay ACKs here! Continue sending dup ACKs for each subsequent packet (not N) ACK 3000 SEQ=3000 , SEQ=4000 X SEQ=5000 • Sender gets 3 duplicate ACKs: – Infers N is lost and resends – 3 chosen so out-of-order packets don’t trigger Fast Retransmit accidentally – Called “fast” since we don’t need to wait for a full RTT size=1000 SEQ=6000 ACK 3000 SEQ=3000 , size=1000 time Introduced in TCP Reno Web Servers: Implementation and Performance Erich Nahum 109

Other loss recovery methods • Selective Acknowledgements (SACK): – Returned ACKs contain option w/SACK block – Block says, "got up N-1 AND got N+1 through N+3" – A single ACK can generate a retransmission • New Reno partial ACKs: – New ACK during fast retransmit may not ACK all outstanding data. Ex: • Have ACK of 1, waiting for 2 -6, get 3 dup acks of 1 • Retransmit 2, get ACK of 3, can now infer 4 lost as well • Other schemes exist (e. g. , Vegas) • Reno has been prevalent; SACK now catching on Web Servers: Implementation and Performance Erich Nahum 110

How about Connection Teardown? Web Servers: Implementation and Performance client server X) FIN( close() ACK(X + 1) FIN(Y) +1) (Y ACK Erich Nahum timed wait • Either side may terminate a connection. ( In fact, connection can stay halfclosed. ) Let's say the server closes (typical in WWW) • Server sends FIN with seq Number (SN+1) (i. e. , FIN is close() a byte in sequence) • Client ACK's the FIN with time SN+2 ("next expected") • Client sends it's own FIN when ready • Server ACK's client FIN as well with SN+1. closed 111

The TCP State Machine • TCP uses a Finite State Machine, kept by each side of a connection, to keep track of what state a connection is in. • State transitions reflect inherent races that can happen in the network, e. g. , two FIN's passing each other in the network. • Certain things can go wrong along the way, i. e. , packets can be dropped or corrupted. In fact, machine is not perfect; certain problems can arise not anticipated in the original RFC. • This is where timers will come in, which we will discuss more later. Web Servers: Implementation and Performance Erich Nahum 112

TCP State Machine: Connection Establishment • CLOSED: more implied than actual, i. e. , no connection • LISTEN: willing to receive connections (accept call) • SYN-SENT: sent a SYN, waiting for SYN-ACK • SYN-RECEIVED: received a SYN, waiting for an ACK of our SYN • ESTABLISHED: connection ready for data transfer CLOSED server application calls listen() client application calls connect() send SYN LISTEN SYN_SENT receive SYN send SYN + ACK SYN_RCVD receive SYN send ACK receive SYN & ACK send ACK receive ACK ESTABLISHED Web Servers: Implementation and Performance Erich Nahum 113

TCP State Machine: Connection Teardown • FIN-WAIT-1: we closed first, waiting for ACK of our FIN (active close) • FIN-WAIT-2: we closed first, other side has ACKED our FIN, but not yet FIN'ed • CLOSING: other side closed before it received our FIN • TIME-WAIT: we closed, other side closed, got ACK of our FIN • CLOSE-WAIT: other side sent FIN first, not us (passive close) • LAST-ACK: other side sent FIN, then we did, now waiting for ACK Web Servers: Implementation and Performance ESTABLISHED close() called send FIN receive FIN send ACK FIN_WAIT_1 receive ACK of FIN receive FIN send ACK FIN_WAIT_2 receive FIN send ACK CLOSE_WAIT close() called send FIN CLOSING receive ACK of FIN LAST_ACK TIME_WAIT receive ACK wait 2*MSL (240 seconds) CLOSED Erich Nahum 114

Summary: TCP Protocol • Protocol provides reliability in face of complex network behavior • Tries to trade off efficiency with being "good network citizen" • Vast majority of bytes transferred on Internet today are TCP-based: – – Web Mail News Peer-to-peer (Napster, Gnutella, Free. Net, Ka. Zaa) Web Servers: Implementation and Performance Erich Nahum 115

Chapter 8: TCP Dynamics Web Servers: Implementation and Performance Erich Nahum 116

TCP Dynamics • In this section we'll describe some of the problems you can run into as a WWW server interacting with TCP. • Most of these affect the response as seen by the client, not the throughput generated by the server. • Ideally, a server developer shouldn't have to worry about this stuff, but in practice, we'll see that's not the case. • Examples we'll look at include: – – The initial window size The delayed ACK problem Nagle and its interaction with delayed ack Small receive windows interfering with loss recovery Web Servers: Implementation and Performance Erich Nahum 117

TCP’s Initial Window Problem • Recall congestion control: sender receiver – senders’ initial congestion window is set to 1 x. html • Recall delayed ACKs: – ack every other packet – set 200 ms. delayed ack timer t 200 ms. – sender is waiting for ACK since it sent 1 segment – receiver is waiting for 2 nd segment before ACKing 1 st segmen time st ACK of 1 2 nd segme nt • Problem worse than it seems: – multiple objects per web page – IE does not do pipelining! Web Servers: Implementation and Performance segment RTT • Short-term deadlock: RTT GET /inde 3 rd segmen t nd ACK of 2 Erich Nahum + 3 rd seg ments 118

Solving the IW Problem sender receiver Solution: set IW = 2 -4 Web Servers: Implementation and Performance RTT 1 st segmen t 2 nd segme nd st and 2 ACK of 1 RTT nt 3 rd segmen t time 200 ms. – RFC 2414 – Didn't affect many BSD systems since they (incorrectly) counted the connection setup in congestion window calculation – Delayed ACK still happens, but now out of critical path of response time for download x. html GET /inde d ACK of 3 r Erich Nahum 119

Receive Window Size Problem sender receiver Recall Fast Retransmit: • Amount of data in flight: ACK 3000 – MIN(cong win, recv win) – can't ever have more than that outstanding SEQ=3000 , size=1000 SEQ=4000 X SEQ=5000 • In order for FR to work: – enough data has to be in flight – after lost packet, 3 more segments must arrive – 4. 5 KB of receive-side buffer space must be available. – note many web documents are less than 4. 5 KB! Web Servers: Implementation and Performance SEQ=6000 ACK 3000 SEQ=3000 time Erich Nahum , size=1000 120

Receive Window Size (cont) • Previous discussion assumes large enough receive windows! sender receiver , RWIN ACK 3000 – Early versions of MS Windows had 16 KB default recv. window – Study server TCP traces from 1996 Olympic Web Server – show over 50% of clients have receive window < 10 K – Many suffer coarse-grained retransmission timeouts (RTOs) – Even SACK would not have helped! Web Servers: Implementation and Performance SEQ=3000 , size=1000 RTO timeout • Balakrishnan et al. 1998: = 2000 X SEQ=4000 , RWIN ACK 3000 = 1000 (illegal for sender to send more) SEQ=3000 time Erich Nahum , size=1000 121

Fixing Receive Window Problem • Balakrishnan et. al 98 – "Right-edge recovery“ – Also proposed by Lin & Kung 98 – Now an RFC (3042) • How does it work? sender receiver 00 WIN = 20 R , 0 0 0 3 ACK SEQ=3000 , size=1000 X SEQ=4000 – Arrival of dup ack means, segment has left the network IN = 1000 W R , 0 0 0 ACK 3 – When dup ACK is received, send SEQ=5000 next segment (not retransmission) 00 WIN = 10 R – Continue with 2 nd and 3 rd dup acks , 0 0 0 3 ACK – Idea is "keep ACK clock flowing" by SEQ=6000 forcing more duplicate acks to be rd 3 dup IN = 1000 W R , 0 0 0 generated ACK 3 ack! SEQ=3000 – Claim is that it would have avoided , size=1000 time 25% of course-grained timeouts in 96 Olympics trace Web Servers: Implementation and Performance Erich Nahum 122

The Nagle Algorithm • Different types of TCP traffic exist: – Some apps (e. g. , telnet) send one byte of data, then wait for ACK – Others (e. g. , FTP) use full-size segments • Recall server can write() to a socket at any time – Once written, should host stack send? Or should we wait and hope to get more data? • May send many small packets, which is bad for 2 reasons: – Uses more network bandwidth (raises ratio of headers to content) – Uses more CPU (many costs are per-packet, not per-byte) Web Servers: Implementation and Performance Erich Nahum 123

The Nagle Algorithm Solution is the Nagle algorithm: • If full-size segment of data is available, just send • If small segment available, and there is no unacknowledged data outstanding, send • Otherwise, wait until either: – More data arrives from above (can coalesce packet), or – ACK arrives acknowledging outstanding data • Idea is have at most one small packet outstanding Web Servers: Implementation and Performance Erich Nahum 124

Interaction of Nagle & Delayed ACK • Nagle and delayed ACK's cause (temporary) deadlock: x. html RTT GET /inde write() 1 st segmen (Nagle forbids sender from sending more) 200 ms. write() t (full size) segment st ACK of 1 2 nd segme nt (half size ) RTT – Sender wants to send 1. 5 segments, sends first full one – Nagle prevents second from being sent (since not full size, and now we have unacked data outstanding) – Sender waits for delayed ACK from receiver – Receiver is waiting for 2 nd segment before sending ACK – Similar to IW=1 problem earlier sender receiver Result: Many disable Nagle. – via setsockopt() call Web Servers: Implementation and Performance Erich Nahum 125

Interaction of Nagle & Delayed ACK • For example, in WWW servers: – – original NCSA server issued a write() for every header Apache does its own buffering to do a single write() call other servers use writev() (e. g. , Flash) if not careful you can flood the network with packets • More of an issue when using persistent connections: – closing the connection forces data out with the FIN bit – but persistent connections or 1. 0 “keep-alives” affected • Mogul and Minshal 2001 evaluate a number of modifications to Nagle to deal with this • Linux has similar "TCP_CORK" option – suppresses any non-full segment – application has to remember to disable TCP_CORK when finished. Web Servers: Implementation and Performance Erich Nahum 126

Summary: TCP Dynamics • Many ways in which an HTTP transfer can interact with TCP • Interaction of factors can cause delays in response time as seen by clients • Hard to shield server developers from having to understand these issues • Mistakes can cause problems such as flood of small packets Web Servers: Implementation and Performance Erich Nahum 127

Chapter 9: TCP Implementation Web Servers: Implementation and Performance Erich Nahum 128

Server TCP Implementation • In this section we look at ways in which the host TCP implementation is stressed under large web server workloads. Most of these techniques deal with large numbers of connections: – Looking up arriving TCP segments with large numbers of connections – Dealing with the TIME-WAIT state caused by closing large number of connections – Managing large numbers of timers to support connections – Dealing with memory consumption of connection state • Removing data-touching operations – byte copying and checksums Web Servers: Implementation and Performance Erich Nahum 129

In the beginning…BSD 4. 3 packet arrival: ? • Recall how demultiplexing works: IP: 10. 1. 1. 2, port: 5194 – given a packet, want to find connection state (PCB in BSD) – 4 -tuple of source, destination port & IP addresses • Original BSD: – used one-behind cache with linear search to match 4 -tuple – assumption was "next segment very likely is from the same connection“ – assumed solitary, long-lived, FTPlike transfer – average miss time is O(N/2) (N=length of PCB list) One-behind cache Head of PCB list IP: 192. 123. 168. 40, port: 23 IP: 1. 2. 3. 4, port: 45981 IP: 9. 2. 16. 1, port: 873 IP: 118. 23. 48. 3, port: 65383 IP: 10. 1. 1. 2, port: 5194 Web Servers: Implementation and Performance Erich Nahum 130

PCB Hash Tables • Mc. Kenney & Dove Sig. Comm 92: – linear search with one-behind cache doesn't work well for transaction workloads – hashing does much better – hash based on 4 -tuple – cost: O(1) (constant time) packet arrival: ? IP: 10. 1. 1. 2, port: 5194 PCB Hash Table O ------- (N-1) • BSD adds hash table in 90's – other BSD Unixes (such as AIX) quickly followed. IP: 10. 1. 1. 2, port: 5194 • Algorithmic work on hash tables: – e. . g. , CLR book, “perfect” hash tables – none specific to web workloads – hash table sizing problematic Web Servers: Implementation and Performance Erich Nahum 131

Problem of Old Duplicates • Recall in the Internet: – packets may be arbitrarily duplicated, delayed, and reordered. – while rare, case must be accounted for. SYN (X) CK(X+1) SYN (Y) A • Consider the following: – two hosts connect, transfer data, close – client starts new connection using same 4 -tuple – duplicate packet arrives from first connection – connection has been closed, state is gone – how can you distinguish? Web Servers: Implementation and Performance server client GET /index . html Content + FIN ACK + FIN time ACK SYN (X) Erich Nahum ? 132

Role of the TIME-WAIT State • Solution: don’t do that! Web Servers: Implementation and Performance SYN (X) CK(X+1) SYN (Y) A GET /index . html Content + FIN ACK + FIN time (2 * MSL) – prevent same 4 -tuple from being used – one side must remember 4 tuple for period of time to reject old packets. – spec says, whoever closes the connection must do this (in the TW state). – Period is 2 times maximum segment lifetime (MSL), after which it is assumed no packet from previous conversation will still be alive – MSL defined as 2 minutes in RFC 1122 server client ACK SYN (Z) X reject! Erich Nahum 133

TIME-WAIT Problem in Servers • Recall in a WWW server, server closes connection! – asymmetry of client/server model means many clients – PCB sticks around for 2*MSL units of time • Mogul 1995 CA Election server study: – shows large numbers (90%) of PCB's in TIME-WAIT. – would have been 97% if followed proper MSL! • Example: doing 1000 connections/sec. – Assume MSL is 120 seconds, request takes 1 second. – Have 1000 connections in ESTABLISHED state. – 240, 000 connections in TIME-WAIT state! • FTY 99 propose & evaluate 3 schemes: – – require client to close (requires changing HTTP). have client use new TCP option (client close) (TCP). do client reset (browser, MS did this for a while) claim 50% improvement in throughput, 85% in memory use Web Servers: Implementation and Performance Erich Nahum 134

Dealing with TIME-WAIT • Sorting hash table entries (Aron & Druschel 99) – Demultiplexing requires that all PCB's be examined (for some hash bucket) before you can give up on that PCB and say it was not found. – Since most lookups are for existing connections, most connections will be in ESTABLISHED state rather than TIME-WAIT. – Can sort PCB chain such that TW entries are at the end. Thus, ESTABLISHED entries are at front of chain. Web Servers: Implementation and Performance PCB Hash Table O ------- (N-1) 192. 123. 168. 40: ESTABLISHED 128. 119. 82. 37: TIME_WAIT 9. 2. 16. 145: TIME_WAIT 178. 23. 48. 3: TIME_WAIT 10. 1. 1. 2: TIME_WAIT Erich Nahum 135

Server Timer Management • Each TCP connection can have up to 5 timers associated with it: – delayed ack, retransmission, persistence, keep-alive, time-wait • Original BSD: 192. 123. 168. 40: RTO in 2 secs – linear linked list of PCB's – fast timer (200 ms): walk all PCB's for delayed ACK timer – slow timer (500 ms): walk all PCB's for all other timers – time kept in relative form, so have to subtract time from PCB (500 ms) for 4 larger timers – costs: O(#PCBs), not O(#active timers) Web Servers: Implementation and Performance HEAD OF PCB LIST 1. 2. 3. 4: TIME-WAIT in 30 secs 9. 2. 16. 1: delayed ACK in 100 ms 118. 23. 48. 3: keep-alive in 1 sec 10. 1. 1. 2: persist in 10 secs Erich Nahum 136

Server Timer Management • Can again exploit semantics of the TIME-WAIT state: – If PCB's are sorted by state, delayed ACK timer can stop after it encounters PCB in TIME-WAIT, since ACKs are not delayed for connections in TIME-WAIT state – Aron and Druschel show 25 percent HTTP throughput improvement using this technique – Attribute most of win to reduced timer processing, but probably helps PCB lookup as well. Web Servers: Implementation and Performance PCB Hash Table O ------- (N-1) 192. 123. 168. 40: ESTABLISHED 128. 119. 82. 37: TIME_WAIT 9. 2. 16. 145: TIME_WAIT 178. 23. 48. 3: TIME_WAIT 10. 1. 1. 2: TIME_WAIT Erich Nahum 137

Customized PCB Tables • Maintain 2 sets of PCBs: normal and TIME-WAIT Regular PCB Hash Table O – first done in BSDI in 96 – still must search both PCBs Web Servers: Implementation and Performance (N-1) 192. 123. 168. 40: ESTABLISHED • Aron & Druschel 99: – can compress TW PCBs, since only port and sequence numbers needed – normal still has full PCB state – show you can save a lot of kernel pinned RAM (from 31 MB to 5 MB, a 82% reduction) – results in more RAM available for disk cache, which leads to better performance ------- TIME-WAIT PCB Hash Table O ------- (N-1) 9. 2. 16. 145 128. 119. 72. 4 10. 1. 2 Erich Nahum 138

Scalable Timers: Timing Wheels – use a hash-table-like structure called timing wheel – events are ordered by relative time in the future – given event in future time T, put in slot (T mod N) – list sorted by time (scheme 5) wheel pointer • Varghese SOSP 1987: O Timing Wheel ------- (N-1) Expire: 12 • Each clock tick: – wheel “turns” one slot (mod N) – look at first item in chain: • if ready, fire, check next • if empty or not ready to fire, all done – continue until non-ready item is encountered (or end of list) Web Servers: Implementation and Performance Expire: 22 Expire: 42 Ex: current time = 12, N = 10; Erich Nahum 139

Timing Wheels (cont) • Variant (scheme 6 in paper): – just insert into wheel slot, don’t bother to sort – check all timers in slot on each tick • Original SOSP 1987 paper – premise was more for large-scale simulations – have lots of events happening "in the future" • Algorithmic Costs (assuming good hash function): – O(1) average time for basic dictionary operations • insertion, cancellation, per-tick bookkeeping – O(N) (N = number timers) worst-case for scheme 6 – O(log(N)) worst-case for scheme 5 • Deployment: – Used in Free. BSD as of release 3. 4 (scheme 6) – Variant in Linux 2. 4 (hierarchy of timers with cascade) – Aron claims "about the same perf" as his approach Web Servers: Implementation and Performance Erich Nahum 140

Data-Touching Operations • Lots of research in high-speed network community about how touching data is bad – especially as CPU speeds increase relative to memory • Several ways to avoid data copying: – Use mmap() as described earlier to cut to one copy – Use I/O lite primitives (new API) to move buffers around – Use sendfile() API combined with integrated zero-copy I/O system in kernel • Also a cost to reading the data via checksums: – Jacobson showed how it can be folded into the copy for free, with some complexity on the receive side – I/O Lite /exokernel use checksum caches – Advanced network cards do checksum for you • Originally on SGI FDDI card (1995) • Now on all gigabit adaptors, some 100 base. T adaptors Web Servers: Implementation and Performance Erich Nahum 141

Summary: Implementation Issues • Scaling problems happen in large WWW Servers: – Asymmetry of client/server model – Large numbers of connections – Large amounts of data transferred • Approaches fall into one or more categories: – – – Hashing Caching Exploit common-case behavior Exploiting semantic information Don't touch the data • Most OS's now support these functions over the last 3 years Web Servers: Implementation and Performance Erich Nahum 142