LIS 901 N lecture 5 http URI and

  • Slides: 74
Download presentation
LIS 901 N lecture 5: http URI and apache Thomas Krichel 2003 -01 -19

LIS 901 N lecture 5: http URI and apache Thomas Krichel 2003 -01 -19

Structure • • • http URI apache

Structure • • • http URI apache

http • Stands for the hypertext transfer protocol. This is the most important application

http • Stands for the hypertext transfer protocol. This is the most important application layer protocol on the Internet today, because it provides the foundation for the world wide web. • defined in Fielding, Roy T. , James Gettys, Jeffrey C. Mogul, Paul J. Leach, Tim Berners. Lee ``Hypertext Transfer Protocol -- HTTP/1. 1'' (1999), RFC 2616

history • 1990: version 0. 9 allows for transfer of raw data. • 1996:

history • 1990: version 0. 9 allows for transfer of raw data. • 1996: rfc 1945 defines version 1. 0. by adding attribute: value headers. • 1999: rfc 2616 adds support for • • • hierarchical proxies caching, virtual hosts and some support for persistent connections and is more stringent.

http resource identification • identification of resources is assumed through Uniform Resource Identifiers (URI).

http resource identification • identification of resources is assumed through Uniform Resource Identifiers (URI). • As far as http is concerned, URIs are string. • http can use ``absolute'' and ``relative'' URIs. • A URL is a special case of a URI.

rfc about http An application-level protocol for distributed, collaborative, hypermedia information systems. … HTTP

rfc about http An application-level protocol for distributed, collaborative, hypermedia information systems. … HTTP is also used as a generic protocol for communication between user agents and proxies/gateways to other Internet systems, including those supported by the SMTP, NNTP, FTP, Gopher, and WAIS protocols. In this way, HTTP allows basic hypermedia access to resources available from diverse applications.

overall operation: client side Client sends request, required items are – method – request

overall operation: client side Client sends request, required items are – method – request URI – protocol version • optional items are – request modifiers – client information

overall operation server side • Server sends response, required items are – status line

overall operation server side • Server sends response, required items are – status line – protocol version – success or error code • optional items are – server information – body

middleman • intermediaries come in three flavors – proxies, i. e. forwarding agents –

middleman • intermediaries come in three flavors – proxies, i. e. forwarding agents – gateways, i. e. receiving agents – tunnels, i. e. relay points that do not change the message such as an encryption and decryption device

http assumes transport • http assumes that there is a reliable way to transport

http assumes transport • http assumes that there is a reliable way to transport data from one host on the Internet to another one. • All http requests and responses are separate TCP connections. The default is TCP port 80, but other ports can be used.

Absolute http URL • the absolute http URL is http: //host[: port][[abs_path][? query]] •

Absolute http URL • the absolute http URL is http: //host[: port][[abs_path][? query]] • If abs_path is empty, it is /. • The scheme name "http" and the host name are case-insensitive. • Characters other than those in the ``reserved'' and ``unsafe'' sets of RFC 2396 are equivalent to their ``%HEX HEX'' encoding. • optional components are in [ ]

character sets • A character set is a method used with one of more

character sets • A character set is a method used with one of more tables to convert a sequence of binary digits into a sequence of characters. • http shares the same registry as the MIME multimedia email extensions. It is based at the IANA, at http: //www. isi. edu/innotes/iana/ assignments/media-types • The default character set is ISO-8859 -1.

http messages • There are two types of messages. – Requests are sent form

http messages • There are two types of messages. – Requests are sent form the client to the server. – Responses are sent from the server to the client. • The generic format is the same as for email messages: • • • – start line – message headers – empty line – body Empty lines before the start line are ignored. The request's start line is called the request-line The response start line is called the statusline.

The request headers • • • Accept: Accept-Charset: Accept-Encoding: Accept-Language: Authorization: Expect: From: Host:

The request headers • • • Accept: Accept-Charset: Accept-Encoding: Accept-Language: Authorization: Expect: From: Host: If-Match: If-Modified-Since: If-None-Match: If-Range: If-Unmodified-Since: Max-Forwards: Proxy-Authorization: Range: Referer: TE: User-Agent:

The status line • The status line is a set of lines that are

The status line • The status line is a set of lines that are of the form • HTTP-Version Status-Code Reason-Phrase • The status code is a 3 -digit number used by the computer. • The reason line is a friendly note for a human to read.

Status code classe • 1 Informational: Request received, continuing process • 2 Success: The

Status code classe • 1 Informational: Request received, continuing process • 2 Success: The action was successfully received, understood, and accepted • 3 Redirection: Further action must be taken in order to complete the request • 4 Client Error: The request contains bad syntax or cannot be understood • 5 Server error: The request is valid but can not be executed by the server

Error codes • • • 100 101 200 201 202 203 204 205 206

Error codes • • • 100 101 200 201 202 203 204 205 206 Continue Switching Protocols OK Created Accepted Non-Authoritative Information No Content Reset Content Partial Content

Error codes II • 300 • 301 • • • 302 303 304 305

Error codes II • 300 • 301 • • • 302 303 304 305 307 Multiple Choices Moved Permanently Found See Other Not Modified Use Proxy Temporary Redirect

Error codes III • • • 400 401 402 403 404 405 406 407

Error codes III • • • 400 401 402 403 404 405 406 407 408 Bad Request Unauthorized Payment Required Forbidden Not Found Method Not Allowed Not Acceptable Proxy Authentication Required Request Time-out

Error codes IV • • • 409 410 411 412 413 414 415 416

Error codes IV • • • 409 410 411 412 413 414 415 416 417 Conflict Gone Length Required Precondition Failed Request Entity Too Large Request-URI Too Large Unsupported Media Type Requested range not satisfiable Expectation failed

Error codes V • • • 500 501 502 503 504 505 Internal Server

Error codes V • • • 500 501 502 503 504 505 Internal Server Error Not Implemented Bad Gateway Service Unavailable Gateway Time-out HTTP Version not supported

Response headers • • • Accept-Ranges: Age: Etag: Location: Proxy-Authenticate: Retry-After: Server: Vary: WWW-Authenticate:

Response headers • • • Accept-Ranges: Age: Etag: Location: Proxy-Authenticate: Retry-After: Server: Vary: WWW-Authenticate:

Entityheaders, common to reponse and request • • • Allow: Content-Encoding: Content-Language: Content-Length: Content-Location:

Entityheaders, common to reponse and request • • • Allow: Content-Encoding: Content-Language: Content-Length: Content-Location: Content-MD 5: Content-Range: Content-Type: Expires: Last-Modified

The body • The entity-body (if any) sent with an HTTP request or response

The body • The entity-body (if any) sent with an HTTP request or response is in a format and encoding defined by the entity-header fields. • When an entity-body is included with a message, the data type of that body is determined via the header fields Content. Type and Content-Encoding

GET and HEAD method • The GET method means retrieve whatever information (in the

GET and HEAD method • The GET method means retrieve whatever information (in the form of an entity) is identified by the Request-URI. If the Request-URI refers to a data-producing process, it is the produced data which shall be returned as the entity in the response and not the source text of the process, unless that text happens to be the output of the process. n the response. • The HEAD method is identical to GET except that the server MUST NOT return a message-body in the response.

Conditional & partial GET • The semantics of the GET method change to a

Conditional & partial GET • The semantics of the GET method change to a ``conditional GET'' if the request message includes an – If-Modified-Since – If-Unmodified-Since – If-Match – If-None-Match – If-Range header • The semantics of the GET method change to a ``partial GET'' if the request message includes a Range header field. A partial GET requests that only part of the entity be transferred

The POST method • The POST method is used to request that the origin

The POST method • The POST method is used to request that the origin server accept the entity enclosed in the request as a new subordinate of the resource identified by the Request-URI in the Request-Line. POST is designed to allow a uniform method to cover the following functions: – Annotation of existing resources; – Posting a message to a bulletin board, newsgroup, mailing list, or similar group of articles; – Providing a block of data, such as the result of submitting a form, to a data-handling process; – Extending a database through an append operation.

PUT and DELETE methods • The PUT method requests that the enclosed entity be

PUT and DELETE methods • The PUT method requests that the enclosed entity be stored under the supplied Request-URI. If the Request-URI refers to an already existing resource, the enclosed entity should be considered as a modified version of the one residing on the origin server. • The DELETE method requests that the origin server delete the resource identified by the Request-URI.

URIs (background) • URI: “uniform resource identifier” • Originally, a generalization of: – URL

URIs (background) • URI: “uniform resource identifier” • Originally, a generalization of: – URL (uniform resource locator), – URN (uniform resource name), – URC (uniform resource citation), – and potentially others, • but mainly, URL and URN

The difference (in theory) between URL and URN: • a URL is bound to

The difference (in theory) between URL and URN: • a URL is bound to a location – when resource moves, url changes • a URN is a name – thus location independent, and, in theory, persistent (whatever “persistent” means)

The Other View • Distinction between URL and URN is artificial • Both terms

The Other View • Distinction between URL and URN is artificial • Both terms should be abolished and replaced by “URI” • thus all identifier “schemes” would be URI schemes (even “http”) and no prefix would be necessary (URL, URN, or even URI).

Reasoning • Original URI philosophy: – URLs were a short-term solution and URNs longterm.

Reasoning • Original URI philosophy: – URLs were a short-term solution and URNs longterm. – URL would be a temporary identification mechanism until a location-independent, persistent identifier was developed, the URN. • Now it seems: – URNs won’t be any more persistent than URLs. – persistence is a social problem, not a technical problem

URI vs URL • The term ‘URL’ or “Universal Resource Locator” is not used

URI vs URL • The term ‘URL’ or “Universal Resource Locator” is not used in standards anymore. It generally means a URI that contains a domain-name but it is historical only. • This presentation uses the term URI exclusively. • The term ‘URL’ is still sufficient to convey the meaning but should not be used when precision is necessary.

What does a URI identify? • A URI identifies a Resource. • A URI

What does a URI identify? • A URI identifies a Resource. • A URI only comes into existence when it is bound to a Resource. • A Resource is defined as anything that is identified by a URI. • Resources only come into existence when a URI is bound to it. • A URI cannot exist without a Resource. • A Resource cannot exist without a URI.

it all comes from Plato • The “URI identifies an abstract Resource” formalism assumes

it all comes from Plato • The “URI identifies an abstract Resource” formalism assumes the Platonic concept of “form”. • A Resource, once bound to a URI and brought into existence, is only the abstract ‘essence’ of the ‘real world’ thing’ we perceive. • Any physical or digital version of that Resource is only one of all possible physical representations of that Resource. • For example, http: //openlib. org/home/krichel is a URI for a homepage. Using language and content negotiation it is possible to request that page in many languages and formats. Which version is the Resource? • Answer: none of them. Each is only a representation. It is possible to assign a URI to even the representations. But even still, each Resource is only the abstraction of the physical or digital thing, not the thing itself.

What is ‘resolution’? • ‘Resolution’ means accessing some representation of the Resource that a

What is ‘resolution’? • ‘Resolution’ means accessing some representation of the Resource that a URI identifies. – For ‘http: //foo. com/’ it means accessing the homepage of ‘foo. com’ – For ‘mailto: krichel@openlib. org’ it can mean sending an email message to that address. • For URIs that contain network location information it is simply a matter of visiting that location and doing some function. I. e. ‘foo. com’ is the exact network host that can

The history • Tim Berners-Lee came to the IETF in 1992 to develop the

The history • Tim Berners-Lee came to the IETF in 1992 to develop the World. Wide. Web standards. At the time URIs were known as Universal Resource Locators. • RFC 1738 “Uniform Resource Locators (URL) was published in 1994. • RFC 1738 was updated by RFC 1808, RFC 2368, RFC 2396. • RFC 2396 “Uniform Resource Identifiers (URI): Generic Syntax” is the current standard. • RFC 2396 may be updated to reflect developments in internationalization, terminology updates, and registration procedures.

Confusion… • Due to misunderstandings and the formation of the W 3 C separately

Confusion… • Due to misunderstandings and the formation of the W 3 C separately from the IETF, there was a long term disagreement on certain aspects of URIs, especially when it came to Uniform Resource Names (URNs). • A join IETF/W 3 C URI Interest Group was formed in 2000 to investigate work that needed to be done with URIs in general. • That group published URIs, URLs, and URNs: Clarifications and Recommendations Report from the joint W 3 C/IETF URI Planning Interest Group (draft-mealling-uri-ig-01. txt ) which begins to clarify the problems and proposes solutions.

URN Uniform Resource Names Are defined by RFC 2141 as a particular URI scheme

URN Uniform Resource Names Are defined by RFC 2141 as a particular URI scheme with these characteristics: 1. Permanent – Once a URN is assigned to some Resource it can never be re-assigned to something else. 2. Location Independent – The actual URN should not contain any network location information such as domain-names, IP addresses, file path-names, etc.

RFC 2396 • Berners-Lee, Tim Roy T. Fielding and Larry Masinter (1998) ``Uniform Resource

RFC 2396 • Berners-Lee, Tim Roy T. Fielding and Larry Masinter (1998) ``Uniform Resource Identifiers (URI): Generic Syntax'', rfc 2396 • A Uniform Resource Identifier (URI) is a compact string of character for identifying an abstract or physical resource. • They provide a simple and extensible means for identifying a resource.

operations on a URI • There is a set of operations that can be

operations on a URI • There is a set of operations that can be applied to URIs. For example, for a URL, the access to the resource. • To understand if a given URI instance is valid, we have to study the operations applied to URIs.

benefits of uniformity • It allows different type of resource identifiers to be used

benefits of uniformity • It allows different type of resource identifiers to be used in the same context, even when the mechanisms used to access those resources may differ • it allows uniform semantic interpretation of common syntactic conventions across different types of resource identifiers • it allows introduction of new types of resource identifiers without interfering with the way that existing identifiers are • it allows the identifiers to be reused in many different contexts, thus permitting new applications or protocols to leverage a preexisting, large, and widely-used set of resource identifiers.

Resources and Identity in the RFC • A resource can be anything that has

Resources and Identity in the RFC • A resource can be anything that has identity. Not all resources are network ``retrievable''. The resource is the conceptual mapping to an entity or set of entities, not necessarily the entity which corresponds to that mapping at any particular instance in time. • An identifier is an object that can act as a reference to something that has identity. In the case of URI, the object is a sequence of characters with a restricted syntax.

URI, URL, & URN in the RFC • A URI can be further classified

URI, URL, & URN in the RFC • A URI can be further classified as a locator, a name, or both. The term ``Uniform Resource Locator'' (URL) refers to the subset of URI that identify resources via a representation of their primary access mechanism (e. g. , their network “location”), rather than identifying the resource by name or by some other attribute(s) of that resource. • The term ``Uniform Resource Name'' (URN) refers to the subset of URI that are required to remain globally unique and persistent even when the resource ceases to exist or becomes unavailable.

URN in the RFC • A URN differs from a URL in that it's

URN in the RFC • A URN differs from a URL in that it's primary purpose is persistent labeling of a resource with an identifier. That identifier is drawn from one of a set of defined namespaces, each of which has its own set name structure and assignment procedures. The “urn” scheme has been reserved to establish the requirements for a standardized URN namespace, as defined in “URN Syntax” RFC 2141 and its related specifications.

transcribability • The URI syntax was designed with global transcribability as one of its

transcribability • The URI syntax was designed with global transcribability as one of its main concerns. A URI is a sequence of characters from a very limited set, i. e. the letters of the basic Latin alphabet, digits, and a few special characters. A URI may be represented in a variety of ways.

consequences of transcribability • A URI is a sequence of characters, which is not

consequences of transcribability • A URI is a sequence of characters, which is not always represented as a sequence of octets. • A URI may be transcribed from a non-network source, and thus should consist of characters that are most likely to be able to be typed into a computer, within the constraints imposed by keyboards (and related input devices) across languages and locales. • A URI often needs to be remembered by people, and it is easier for people to remember a URI when it consists of meaningful components.

URI characters • URI consist of a restricted set of characters, nota sequence of

URI characters • URI consist of a restricted set of characters, nota sequence of octets. The allowable characters primarily chosen to aid transcribability and usability both in computer systems and in noncomputer communications. Characters used conventionally as delimiters around URI are excluded. • In the simplest case, the original character sequence contains only characters that are defined in US-ASCII, and the two levels of mapping are simple and easily invertible: each 'original character' is represented as the octet for the US-ASCII code for it, which is, in turn, represented as either the US-ASCII character.

reserved characters • Many URI include components consisting of or delimited by, certain special

reserved characters • Many URI include components consisting of or delimited by, certain special characters. These characters are called ``reserved'', since their usage within the URI component is limited to their reserved purpose. If the data for a URI component would conflict with the reserved purpose, then the conflicting data must be escaped before forming the URI. • they are ; / ? : @ & = + $ , • They are allowed within a URI, but which may not be allowed within a particular component of the generic URI syntax.

unreserved & excluded characters • Those are the characters that are allowed and never

unreserved & excluded characters • Those are the characters that are allowed and never take any special meaning. They are – the upper and lowercase letters a to z and A to Z – the decimal digits 0 to 9 – the following: - _. ! ~ * ‘ ( ) • All characters that are not reserved or unreserved are excluded –<>#%”{}|^[]` – and the blank are excluded. They have to be escaped.

escaping • When you want to use a character in a URI that not

escaping • When you want to use a character in a URI that not one of the excluded characters, you have to escape it The way that this done is to write a construction of the form • % hex • where hex is a digit or the letters a to f (uppercase or lowercase). The two hex characters represent the value of the character in unicode in hex. For example %7 eis the character ~

The Semantic Web • The W 3 C has been developing a new architecture

The Semantic Web • The W 3 C has been developing a new architecture that applies knowledge representation technology to the WWW. • Using the Resource Description Framework (RDF), Statements are made using a Subject, Predicate and Object (very similar to Lisp and other predicate based languages). • Each Subject, Predicate or Object are Resources in the URI sense and are identified by URIs within an RDF Statement using XML Namespaces.

example • This statement says that the Resource identified by the URI ‘http: //openlib.

example • This statement says that the Resource identified by the URI ‘http: //openlib. org/home/krichel’ was created by the person ‘Thomas Krichel’: <? xml version="1. 0"? > <RDF xmlns="http: //www. w 3. org/1999/02/22 -rdfsyntax-ns#"> <Description about="http: //openlib. org/home/krichel"> <Creator xmlns="http: //description. org/schema/">Or a Lassila</Creator> </Description> </RDF>

The Semantic Web • The combination of Web Services and the Semantic Web should

The Semantic Web • The combination of Web Services and the Semantic Web should give the Web the ability to turn any existing Web Resource into a full node in a purposefully built knowledge representation system with a functional component that allows that knowledge to be acted on. • And both are based on the simple Uniform Resource Identifier.

Apache • Is a free, open-source web server that is produced by the Apache

Apache • Is a free, open-source web server that is produced by the Apache Software Foundation, see http: //www. apache. org • It has over 50% of the market share. • It runs best on UN*X systems but can run an a Mickeysoft OS as well. • I will cover it here because it is freely available. • I am covering version 1. 3

Apache in debian • /etc/apache/httpd. conf in set main configuration file. • /etc/init. d/apache

Apache in debian • /etc/apache/httpd. conf in set main configuration file. • /etc/init. d/apache action, where action is one of – start – stop – Restart is used to fire the daemon up or down. • The daemon runs user www-data

Virtual host • On a single installation of apache serveral web servers can be

Virtual host • On a single installation of apache serveral web servers can be supported. • That means the server can behave in a different way according to how it is being addressed. • The easiest way to implement addressing a server in different was is through DNS host names.

Directives in httpd. conf • The configuration directives are grouped into three basic sections:

Directives in httpd. conf • The configuration directives are grouped into three basic sections: – Directives that control the operation of the Apache server process as a whole (the 'global environment'). – Directives that define the parameters of the 'main' or 'default' server, which responds to requests that aren't handled by a virtual host. These directives also provide default values for the settings of all virtual hosts. – Settings for virtual hosts, which allow Web requests to be sent to different IP addresses or hostnames and have them handled by the same Apache server process.

Server type • On a UN*X machine, the server can either be fired up

Server type • On a UN*X machine, the server can either be fired up on its own, or it can be run as part of the overall Internet daemon inetd. • Usually “standalone” is used.

Server root • Sets the directory where apache finds its own configuration files. •

Server root • Sets the directory where apache finds its own configuration files. • If log files names are not given as absolute paths, they will be placen in the server root directory.

Timeout • This set s the number of seconds that the server waits for

Timeout • This set s the number of seconds that the server waits for the result of a request to be comupted before sending a timeout. • On wotan this is set to 300 seconds, this is rather a long time, the user will have gone for coffee by then.

Listen • Tells the server which port and ip address to listen to. This

Listen • Tells the server which port and ip address to listen to. This can be used to have the server only to respond to requests to a certain IP address or to listen to a nonstandard port, i. e. Not port 80

Loadmodule • To extend apache, modules have written. They have to be loaded explicitly:

Loadmodule • To extend apache, modules have written. They have to be loaded explicitly: • Load. Module module file • Where module is the name of the module and file is the name of the file that contains the module • Looking at this gives you vital information about what the server can do.

Server directives • User – Gives the user name apache runs under • Group

Server directives • User – Gives the user name apache runs under • Group – Gives the group name the server runs under • Server. Admin – Email of a human who runs the default server • Server. Name – The name of the default server • Document. Root – The top level directory of the default server

Directory options • • Many options for a directory can be set with <directory

Directory options • • Many options for a directory can be set with <directory name> instructions<directory> Name is the name of a directory. Instructions can be a whole lot of stuff

Directory instructions • Options sets global options for the directory, it can be –

Directory instructions • Options sets global options for the directory, it can be – None – All – Or any of • • • Indexes (form directory indexes? ) Includes (all server side includes? ) Follow. Symlinks (allow to follow server-side includes) Exec. CGI (allow cgi-scripts? ) Multi. Views

Access control • Can be part of <directory> to set directory level access control

Access control • Can be part of <directory> to set directory level access control • Example – Allow from friendly. com – Deny from evil. com • Sometimes you have to set the order, example – Order allow, deny

Authentication • This is used to enable password access. In that case the authentication

Authentication • This is used to enable password access. In that case the authentication is handled by a file. htaccess in the directory. • The Allow. Override instruction is used to state what the user can do within the. htaccess file. Depending on its values, you can password protect a web site. • We will not discuss this further here.

Userdir • This sets the directory that is created by the user in her

Userdir • This sets the directory that is created by the user in her home directory to be accessed by requests to ~user. • On wotan, we have • User. Dir public_html • That is the default, actually.

Set up permission for user home directories <Directory /home/*/public_html> Allow. Override File. Info Auth.

Set up permission for user home directories <Directory /home/*/public_html> Allow. Override File. Info Auth. Config Limit Options +Includes Options Multi. Views Indexes Sym. Links. If. Owner. Match Includes. No. Exec <Limit GET POST OPTIONS PROPFIND> Order allow, deny Allow from all </Limit> <Limit PUT DELETE PATCH PROPPATCH MKCOL COPY MOVE LOCK UNLOCK> Order deny, allow Deny from all </Limit> </Directory>

Logs • The web server logs every transaction. • The are severeal types of

Logs • The web server logs every transaction. • The are severeal types of logs that used to be kept separately, in early days. • 209. 73. 164. 50 - - [26/Jan/2003: 09: 19: 51 0500] "GET /~ramon/videos/ntsc 175. html HTTP/1. 1" 206 808 • Additional information may be kept in the referer and user agent log. • The referer log may have some interesting information on who links to your pages.

Alias • Is a directive to make links between things that are seen at

Alias • Is a directive to make links between things that are seen at the URL level and the file structure on the physical machine. • Example • Alias /home/krichel/stuff • Will show the content of /home/krichel/stuff at the url http: //…/stuff. • Scriptalias works in the same way but allows for scripts to be executed.

Virtural hosts • Most apache directive can be wrapped in a <virtualhost> </virtualhost> grouping.

Virtural hosts • Most apache directive can be wrapped in a <virtualhost> </virtualhost> grouping. • This implies that the only hold for the virtual host. Example, from wotan <Virtual. Host *> Server. Admin krichel@openlib. org Document. Root /home/connect/public_html Server. Name connections 2003. liu. edu Error. Log /var/log/apache/connections 2003 -error. log Custom. Log /var/log/apache/connectios 2003 -access. log common </Virtual. Host>

http: //openlib. org/home/krichel Thank you for your attention!

http: //openlib. org/home/krichel Thank you for your attention!