DBI Representation and Management of Data on the

DBI Representation and Management of Data on the Internet

HTTP Hyper. Text Transfer Protocol

…In the Beginning The Internet Let there be a Web FTP –File Transfer Protocol SMTP –Simple Mail Transfer Protocol NNTP –Network-News Transfer Protocol HTTP –Hyper. Text Transfer Protocol Tim Berners-Lee

The Creation of the Web • Tim Berners-Lee implemented the HTTP protocol in 1990 -1 at CERN, the European Center for High-Energy Physics in Geneva, Switzerland. • The World-Wide Web is based upon – Information representation in HTML (Hyper. Text Markup Language) documents – Resources Transmission in HTTP (Hyper. Text Transfer Protocol)

Previous HTTP Versions • HTTP/0. 9 used by WWW since 1990 • HTTP/1. 0 [RFC 1945] – Supports MIME (Multipurpose Internet Mail Extension) messages [RFC 1341] • MIME transmits non-textual files by encoding them – Content negotiation • HTTP/1. 1 [RFC 2068] – Persistent connections – Caching

General Features • Lightness and speed (response time of 100 ms in a hypertext jump) • • Client-Server protocol Stateless object-oriented protocol Open-ended set of methods and headers Typing and negotiation of data representation

Terminology • User agent: client which initiates a request (browser, editor, web robot, …) • Origin server: the server on which a given resource resides • Proxy: acts as both a server and a client • Gateway: server which acts as intermediary for other servers • Tunnel: acts as a blind relay between two connections

Client-Server Protocol • The browser is the client • The client sends requests to an HTTP Server

Client-Server Sessions • The HTTP protocol supports a short conversation between browser and server • The entire conversation is conducted using ASCII characters (8 -bit) • The standard (and default) port for HTTP servers to listen on is 80, though they can use any port

HTTP Session • A basic HTTP session has four phases: 1. Client opens the connection (a TCP connection) 2. Client makes the request 3. Server sends a response 4. Server closes the connection

Nested Objects • Suppose a client accesses a page containing 10 inline images; to display the page completely would require 11 HTTP sessions • Some browsers/servers support a feature called keep-alive which can keep the connection open until it is explicitly closed

Index. html Left frame Jumping fish Right frame Fairy icon HUJI icon

Stateless Protocol • HTTP is a stateless protocol, which means that once a server has delivered the requested data to a client, the connection is broken, and the server retains no memory of what has just taken place

Resources • A resource is a chunk of information that can be identified by a URL (Universal Resource Locator) – The most common kind of resource is a file, but a resource may also be • A dynamically-generated query result • The output of a CGI script, or • An active server page

URL • Universal Resource Identifiers [RFC 2396] are used to specify the object of a method – as an address (URL) – as a name (URN) URL = “http: //” host [“: ” port] [path] IP addresses in URLs should be avoided [RFC 1900]

Different URLs • There are different types of URL’s – http: //<host>: <port>/<path>? <searchp art> – mailto: <account@site> – news: <newsgroup-name>

In a URL • Spaces are represented by “+” • Characters such as &, +, % are encoded in the form “%xx” where xx is the ascii value in hexadecimal; For example, “&” = “%26” • The inputs to the parameters are in a list of the following form Var 1=value 1&var 2=value 2&var 3=value 3

War&peace Tolstoy

http: //www. google. com/search? lr=&safe=off&q=war%26 peace+Tolstoy

Format of Request and Response • • An initial line Zero or more header lines A blank line (i. e. , a CRLF by itself), and An optional message body (e. g. , a file, query data, or query output) Note: CRLF = “rn” (usually ASCII 13 followed by ASCII 10)

Request • A request consists of: – Initial line – Headers – Blank line – Message body

Initial Line of a Request • The initial line consists of – Method – Path – HTTP Version

Request Format

Request Example GET /courses/dbi/index. html HTTP/1. 0 From: yarok@cs. huji. ac. il User-Agent: HTTPTool/1. 0 [blank line here] Method Path Initial line Version Headers

Do Not Forget CRLF GET /courses/dbi/index. html HTTP/1. 0 [CRLF] From: yarok@cs. huji. ac. il [CRLF] User-Agent: HTTPTool/1. 0 [CRLF]

Request Methods • GET returns the contents of the indicated document – The most frequently used command • HEAD returns the header information for the indicated document – Useful for finding out info about a resource without retrieving it • POST treats the document as a script and sends some data to it

More Methods • PUT replaces the contents of the document with some data • DELETE deletes the indicated document • TRACE invokes a remote loop-back of the request. The final recipient SHOULD reflect the message back to the client • Usually these methods are not allowed

GET Method • GET is the most common HTTP method • It says “give me this resource”

GET Requests With a Proxy /~dbi/index. html Client http: //www. cs. huji. ac. il/~dbi/index. html Web Server www. cs. huji. ac. il /~dbi/index. html Client Proxy Server Web Server www. cs. huji. ac. il

HEAD Request • A HEAD request asks the server to return the response headers only, and not the actual resource (i. e. , no message body) • Same as GET but without the message body • This is useful for checking characteristics of a resource without actually downloading it, thus saving bandwidth • Used for testing hypertext links for validity, accessibility and recent modification

Post • POST request can send data to the server • POST is mostly used in form-filling – The contents of a form are translated by the browser into some special format and sent to a script on the server using the POST command

(. Post (cont • There is a block of data sent with the request, in the message body • There are usually extra headers to describe this message body, like Content-Type: and Content-Length: • The request URI is a program to handle the sent data, not a resource to retrieve • The HTTP response is normally the output of a program, not a static file

Post Example • Here's a typical form submission, using POST: POST /path/script. cgi HTTP/1. 0 From: frog@cs. huji. ac. il 35 characters User-Agent: HTTPTool/1. 0 Content-Type: application/x-www-form-urlencoded Content-Length: 35 home=Ross+109&favorite+flavor=flies

Headers • HTTP 1. 0 defines 16 headers – none are required • HTTP 1. 1 defines 46 headers – one header (Host: ) is required in requests

Headers • From: – gives the email address of whoever is making the request or running the program doing so • User-Agent: – identifies the program that's making the request, in the form "Program-name/x. xx", • x. xx is the (mostly) alphanumeric version of the program. • For example, Netscape 3. 0 sends the header "User-agent: Mozilla/3. 0 Gold"

(. Headers (cont • Server: – analogous to the User-Agent: header: – it identifies the server software in the form "Program-name/x. xx". – For example, one beta version of Apache's server returns "Server: Apache/1. 2 b 3 -dev"

(. Headers (cont • If an HTTP message includes a body, there are usually header lines in the message that describe the body. In particular, • Content-Type: – gives the MIME-type of the data in the body, such as text/html or image/gif • Content-Length: – gives the number of bytes in the body

(. Headers (cont • Last-Modified: – Gives the modification date of the resource that's being returned – It's used in caching and other bandwidth-saving activities – Greenwich Mean Time should be used and the format is Last-Modified: Fri, 31 Dec 1999 23: 59 GMT

Initial Line of a Response • The initial line of a response is also called the status line. • The initial line consists of – HTTP version – response status code – reason phrase that describes the status code

Response Format

Response Example Initial line HTTP/1. 0 200 OK Date: Fri, 31 Dec 1999 23: 59 GMT Content-Type: text/html Content-Length: 1354 <html> <body> <h 1>Hello World</h 1> (more file contents). . . </body> </html> Version Status code Reason phrase Headers Message body

Status Code • The status code is a three-digit integer, and the first digit identifies the general category of response: – 1 xx indicates an informational message only – 2 xx indicates success of some kind – 3 xx redirects the client to another URL – 4 xx indicates an error on the client's part • Yes, the system blames it on the client if a resource is not found (i. e. , 404) – 5 xx indicates an error on the server's part

Status Code 1 xx • The 100 (Continue) Status – Allows a client to determine if the Server is willing to accept the request (based on the request headers) before the client sends the request body – The client’s request must have the header Expect: 100 (Continue) • 101 Status -- Switching Protocols

Status Code 2 xx Status codes 2 xx -- Success • The action was successfully received, understood, and accepted – 200 OK – 201 POST command successful – 202 Request accepted – 203 GET or HEAD request fulfilled – 204 No content

Status Code 3 xx Status codes 3 xx -- Redirection • Further action must be taken in order to complete the request – 300 Resource found at multiple locations – 301 Resource moved permanently – 302 Resource moved temporarily – 304 Resource has not modified (since date)

Status Code 4 xx Status codes 4 xx -- Client error • The request contains bad syntax or cannot be fulfilled – – – – 400 401 402 403 404 405 406 Bad request from client Unauthorized request Payment required for request Resource access forbidden Resource not found Method not allowed for resource Resource type not acceptable

Status Code 5 xx Status codes 5 xx -- Server error • The server failed to fulfill an apparently valid request – 500 – 501 – 502 – 503 – 504 Internal server error Method not implemented Bad gateway or server overload Service unavailable / gateway timeout Secondary gateway / server timeout

Response Information • Description of information – – – – Server Type of server Date Date and time Content-Length Number of bytes Content-Type Mime type Content-Language English, for example Content-Encoding Data compression Last-Modified Date when last modified Expires Date when file becomes invalid

Manually Experimenting with HTTP >host www. cs. huji. ac. il is a nickname for vafla. cs. huji. ac. il has address 132. 65. 80. 39 vafla. cs. huji. as. il mail is handled (pri=10) by cs. huji. ac. il >telnet www. cs. huji. ac. il 80 Trying 132. 65. 80. 39… Connected to vafla. cs. huji. ac. il. Escape character is ‘^]’.

Sending a Request >GET /~dbi/index. html HTTP/1. 0 [blank line]

The Response HTTP/1. 1 200 OK Date: Sun, 11 Mar 2001 21: 42: 15 GMT Server: Apache/1. 3. 9 (Unix) Last-Modified: Sun, 25 Feb 2001 21: 42: 15 GMT Content-Length: 479 Content-Type: text/html <html> (html code …) </html>

GET /~dbi/index. html HTTP/1. 0 HTTP/1. 1 200 OK HTML code

GET /~dbi/no-such-page. html HTTP/1. 0 HTTP/1. 1 404 Not Found HTML code

GET /index. html HTTP/1. 1 400 Bad Request HTML code Why is it a Bad Request? HTTP/1. 1 without Host Header

HTTP 1. 1 HTTP/1. 1 is replacing/has replaced HTTP/1. 0 as the new Web protocol

Improvements • Faster response – allowing multiple transactions to take place over a single persistent connection – adding cache support • Faster response for dynamically-generated pages – supporting chunked encoding, which allows a response to be sent before its total length is known • Efficient use of IP addresses – allowing multiple domains to be served from a single IP address

Improvements over HTTP 1. 0 • HTTP/1. 1 has a number of features/improvements over HTTP/1. 0, including – – – – Persistent TCP connections Partial document transfers Conditional fetch Support for nonstandard HTTP/1. 0 extensions Better support for alternative character sets More flexible authentication Faster response and great bandwidth savings Efficient use of IP addresses (virtual hosting)

Non-Persistent Connections 1 Browser opens TCP connection to port 80 of server (handshake) 2 Browser sends http request message 3 Server receives request, locates object, sends response 4 Server closes TCP connection 5 Client receives response, parses object 6 Repeat 1 -4 for each embedded object

Persistent Connection 1 Browser opens TCP connection to port 80 of server (handshake) 2 Browser sends http request message 3 Server receives request, locates object, sends response 4 Client receives response, parses object 5 Repeat 2 -4 for each embedded object 6 TCP connection closes on demand or timeout

Advantages of Persistent Connection • CPU time saved in routers and hosts • HTTP requests and responses can be pipelined on a connection • network congestion is reduced • latency on subsequent requests is reduced

Pipelines • 2 types of persistent connections – without pipelining • the client issues a new request only after the previous response has arrived – with pipelining • client sends the request as soon as it encounters a reference • multiple requests/responses – on the same IP packet, or – on back-to-back packets

Virtual Hosts • With HTTP 1. 1, one server at one IP address can be multi-homed: – “www. cs. huji. ac. il” and “www. math. huji. ac. il” can live on the same server – These are called virtual hosts – Without this mechanism, we have to use 2 different IP addresses • It is like several people sharing one phone • An HTTP request must specify the host name (and possibly port) for which the request is intended

Example • The request specifies the host: GET /path/file. html HTTP/1. 1 Host: www. host 1. com: 80

(. Virtual Hosting (cont • Virtual hosting – reduces hardware expenditures – extends the ability to support additional servers – makes load balancing and capacity planning much easier • Without it – each host name requires a unique IP address, and we are quickly running out of IP addresses with the explosion of new domains

The Date Header • In HTTP 1. 1, servers must include the generation time of the response in the Date: header • Time values use Greenwich Mean Time (GMT) and have the format Date: Fri, 31 Dec 1999 23: 59 GMT • Date is omitted only in a few cases, e. g. , status code 100 (continue) and some server errors • Servers must synchronize their clocks with a reliable external standard

Caching improves performance • Eliminates the need to send requests in many cases (reduces network round-trips), using an expiration mechanism • Eliminates the need to send full responses in other cases (reduces network bandwidth), using a validation mechanism

Client Caching • Client GET /fruit/apple. gif • Server responds with Last-Modified-Date: . . . • Client caches object and lastmodified-date • Client sends client cache GET /fruit/apple. gif … If-Modified-Since: … • Server returns either 304 Not Modified or object server

Network Caches GET /fruit/apple. gif client server proxy server GET /fruit/apple. gif client server

Benefit of Caching 10 Mbps LAN client server 1. 5 Mbps client 15 req/sec 100 Kbits/req client R R Internet proxy server 40% hit rate server

Expiration Model • Servers may provide an expiration time using the Expires header – By checking the expiration time, the cache can return a fresh response without contacting the server • If the expiration time is not specified, the cache can heuristically estimate the expiration times (e. g. , using header values, such as the Last-Modified time)

The Risk in Caching • Response might not be “semantically transparent” – the response is different from what would have been returned by the origin server • The cache should verify that the copy is fresh (i. e. , expiration time has not passed) • The copy is stale if it is not fresh

Validators • A validator is any mechanism that may help in determining whether a copy is fresh or stale – A strong validator is, for example, a counter that is incremented whenever the resource is changed – A weak validator is, for example, a counter that is incremented only when a significant change is made

Using the Cache • To check whether a copy is fresh, the cache must either – Use the expiration model, or – Compare the Last-Modified time or some validator with the origin server • In the second case, the origin server either – Responds with the message 304(Not Modified), or – Sends a full response with the entity body

Cache-Control Header • Cache-control headers specify directives to the cache – Can be included in either requests or responses • The server can specify “must-revalidate” – Cache must revalidate with the origin server that the copy is still fresh • The client can specify – the max-age of an unvalidated response – The max-stale time of a stale copy

Do not Use a Cache • The Pragma: no-cache request header indicates that the request should not be satisfied from a cache • Same as the no-cache cash-directive • Should include both if server is not HTTP/1. 1 compliant • Directive applies to any recipient along the request/response chain

If-Modified-Since Header • The If-Modified-Since: header is used with a GET request • If the requested resource has been modified since the given date, the server returns the resource as it normally would (i. e. , header is ignored) • Otherwise, the server returns a 304 Not Modified response, including the Date: header, but with no message body HTTP/1. 1 304 Not Modified Date: Fri, 31 Dec 1999 23: 59 GMT [blank line here]

If-Unmodified-Since Header • The If-Unmodified-Since: header can be used with any method • If the requested resource has not been modified since the given date, the server returns the resource as it normally would • Otherwise, the server returns a 412 Precondition Failed response HTTP/1. 1 412 Precondition Failed [blank line here]

Cooperative Caching

(. Cooperative Caching (cont • Higher level cache (e. g. , national cash) – larger user population – higher hit rates • Multiple Web cashes which cooperate => Improve overall performance • Cooperative cashes usually built from clusters – divide the traffic overhead – improve storage capacity

(. Cooperative Caching (cont • Which cashes should be asked for a particular doc? • Hash routing (of URLs) -- an object will not be present in more than one cash

Hop by Hop • HTTP/1. 1 introduces the concept of hop-byhop headers: – Message headers that apply only to a given connection, and not to the entire path – It enables much more power with the usage of proxies (cashes)

Hop-by-Hop Headers • Connection – options that are desired for that particular connection (e. g. , connection: close) • Public – lists the set of methods supported by the server • Proxy-Authenticate – enables authentication methods between two hops • Transfer-Encoding – compression method between two hops • Upgrade – additional communication protocols

Chunked Encoding Wake up, we speak about movies in the Internet • Chunked encoding – Transmission of streaming multimedia • One frame varies in size and composition from the next – Streaming video • Entire image transmitted in first chunk and differences from the previous image are transmitted in the next chunk

Compression • Most image formats (GIF, JPEG, MPEG) are precompressed • Many other data types used in the Web are not precompressed • Compression could save almost 40% of the bytes sent via HTTP • There is a need for negotiating the type of encoding of the compressed resource

(. Compression (cont • Client sends the header Accept-Encoding – The header indicates the content-encodings that the client can handle and the ones that the client prefers • Server Sends – Content-Encoding header – for end-to-end encoding indication – Transfer-Encoding header - for hop-to-hop encoding indication (supported only in HTTP/1. 1)

Content Negotiation • Content Negotiation: – the process of selecting the best representation for a given response when there are multiple representations available • HTTP supports two kinds of content negotiation: – Server-driven negotiation – Agent-driven negotiation

Server-Driven Negotiation The selection is made by the server, based on: – header field in the request (client preferences): Accept-Language / Accept-Encoding – available representations of the response – other information (i. e. , address of the client) Disadvantages: – Impossible for the server to determine what is best for the user – Inefficiency (clients should describe their capabilities in every request) – Complicates implementation of servers

Agent-Driven Negotiation • Selection is made by the client after receiving an initial response from the server – Based on available representations specified in the initial response – Automatic or manual • Disadvantages: – needs a second request to obtain the best alternative representation

Protocol Switching • Protocol switching – Client can specify another protocol more suited to the data being transferred (e. g. , real-time synchronous protocol) I want another I hate HTTP/1. 0 protocol

Authentication • Many sites require users to provide a username and password in order to access the documents housed on the server • This requirement provides a mechanism for keeping track of users (more than just a security mechanism(

Authentication Who are you? /~dbi/index. html /~dbi Who are you? /index. html I am Donald response My password is Duck Client Web Server www. cs. huji. ac. il

Authentication • How does it’s work? – Client sends • ordinary request message – server responds with • 401 Authorization Required status code • WWW-Authenticate header which specifies how to perform authentication – Client resends • the requested message, but this time including the Authorization header (e. g. , user-name & password) – The client continues to add this header for each following request to that server

Cookies • Alternative way to identify browsers • Server response includes the Set-cookie header that has the attributes – name = VALUE – expires = DATE STRING – domain = DOMAIN NAME – path = PATH – secure • Client returns cookie with matching URLs

Cookies • Example: – Client contacts a web site for the first time – Server response includes the header: Set-cookie : 1678453 – Client stores the cookie value and the server name in a special “cookie file” – For each further request for that server, the client will add the header Cookie : 1678453

(. Cookies (cont • Usage: – Server requires authentication, but doesn’t want to hassle a user with a user-name and password – Remembering user’s preferences for advertising – Cookies enable creating a virtual shopping cart • Problems – users who access the same site from different machines

? Are you HTTP experts now • Not yet • There are more headers, for example, that this talk did not cover • To know more, go to the specifications

Additional Information • For specifications and additional information: – http: //www. w 3. org/Protocols/Specs. html – http: //www. jmarshall. com/easy/http/ – http: //wdvl. com/Internet/Protocols/HTTP/articl e. html