UFCEKG20 2 Data Schemas Applications Lecture 3 Data

  • Slides: 24
Download presentation
UFCEKG-20 -2 Data, Schemas & Applications Lecture 3 Data Representation, XML & RSS

UFCEKG-20 -2 Data, Schemas & Applications Lecture 3 Data Representation, XML & RSS

Last week: o introduction to the web o uri schemas & encoding o http

Last week: o introduction to the web o uri schemas & encoding o http protocol o media types o request / response cycle o get, post, put and delete o introduction to mashups o simple mashup example with forms

WWW : definition The World Wide Web (abbreviated as WWW or W 3, commonly

WWW : definition The World Wide Web (abbreviated as WWW or W 3, commonly known as the Web), is a system of interlinked hypertext documents accessed via the Internet. With a web browser, one can view web pages that may contain text, images, videos, and other multimedia, and navigate between them via hyperlinks. Wikipedia : World Wide Web Concept originally proposed by Sir Tim Berners-Lee (1989) based on earlier hypertext systems. Berners-Lee and Belgian computer scientist Robert Cailliau proposed in 1990 to use hypertext "to link and access information of various kinds as a web of nodes in which the user can browse at will", and they publicly introduced the project in December of the same year.

Problem : How to encode data for communication Competing constraints o o o Data

Problem : How to encode data for communication Competing constraints o o o Data must be serialised into a character stream Communicate the meaning of the data as well as the data Error-free Minimimal size Handle Multi-Lingual text Bank of America Market Data Mirrors

Solutions o o o Card file based csv xls - Excel file format XML

Solutions o o o Card file based csv xls - Excel file format XML SQL export JSON - Java. Script Object Notation The Medabar in Asmara, Eritrea Google Map

Card-based Examples o ATCO-CIF for timetables o IGES for Computer-Aided Design Characteristics o o

Card-based Examples o ATCO-CIF for timetables o IGES for Computer-Aided Design Characteristics o o Based on old 80 -column punched cards Muliple record types Fixed field widths No formal language to define the format

CSV Examples o Alveston (Bristol) weather data o World Health Organization(WHO) - generated estimates

CSV Examples o Alveston (Bristol) weather data o World Health Organization(WHO) - generated estimates of TB mortality, prevalence, incidence (including incidence of HIV+TB) and case detection rate. o 1000 Songs - Google Spreadsheet Characteristics o Data values separated by a common separator character - space, comma or tab o Column position is significant o Lines separated by newlines - coding depends on OS - linefeed (x 0 A) Unix or carriage-return (x 0 D), line feed - Windows, carriage-return on old Macs o Separator must not occur in data values, or some other convention needed Quotes around value, an escape character o Column headings may be the first line o Only tables - all lines the same o All columns required - problem for space-separated data

Tagged record structures Data with optional data and repeated data need more complex structures.

Tagged record structures Data with optional data and repeated data need more complex structures. Many have been developed for specific domains o o MARC library catalogue records EDIFACT for commercial Electronic Data interchange (EDI) EDIF LISP -based nested data EXIF data embedded in a JPEG image

XML A generic data format based on tagged elements in a tree structure. Developed

XML A generic data format based on tagged elements in a tree structure. Developed from GML, via SGML. GML, a document markup language developed by Charles Goldfarb at IBM in 1969. Examples o Alveston WDL config file o UWE news RSS feed Tree with Buddhist prayer flags

XML domain vocabularies XML defines only the rules for a well-formed document. The allowable

XML domain vocabularies XML defines only the rules for a well-formed document. The allowable tags, their structuring and order in a document, range of allowable values and the meaning of those tags depends on the XML application - called a vocabulary. There are now hundreds of XML vocabularies designed for every sort of data o XHTML - the version of HTML which conforms to XML o SVG - graphics o Trans. Exchange for timetables o RSS and Atom for news

XML processing vocabularies There also vocabularies for languages for processing XML o o XSLT

XML processing vocabularies There also vocabularies for languages for processing XML o o XSLT - for transforming XML documents XSL-FO - for transforming to PDF documents XML Schema - for defining XML vocabularies XProc - for defining XML Pipelines

Problem: News dissemination I want to disseminate news about my project/company, and allow interested

Problem: News dissemination I want to disseminate news about my project/company, and allow interested people to read it. e. g. the university wants to spread the news about successful staff Solution 1 : HTML page Publish a page of news on the website in HTML Problems o how do visitors know when its changed? o news from different universities cannot be easily combined – (why? )

Solution : email Encourage interested users to subscribe to your company newsletter. Problems o

Solution : email Encourage interested users to subscribe to your company newsletter. Problems o Subscription is a barrier o Clutters up email boxes o can look like spam o List management and emailing overhead

Solution : Create XML document for news UWE makes up its own set of

Solution : Create XML document for news UWE makes up its own set of additional tags <news. Item date=‘ 2007 -10 -2’> <news. Title>UWE best in West</news. Title> <news. Body>UWE wins tiddlewinks again</news. Body> <contact>press@uwe. ac. uk</Contact> </news. Item> Problems o someone has to design this language o has to be translated to HTML to display o s reader has to understand multiple new tags from different sources o needs to be distinguished from standard HTML

Aside: Namespaces Problem How to distinguish in a document XML tags from different vocabularies

Aside: Namespaces Problem How to distinguish in a document XML tags from different vocabularies ? Solution o define a (global) unique URI for the vocabulary o use an arbitrary prefix - news: for all tags in the same vocubulary - unique within a document o link the prefix to the vocabulary in the document <h 1>UWE news</h 1> <p> <news: item xmlns="http: //www. uwe. ac. uk/news" date="2007 -10 -2“> <news: Title>UWE best in West</news: Title> <news: Body>UWE wins tiddlewinks again</news: Body> <news: Contact>press@uwe. ac. uk</news: Contact> </news: item> </p>

Solution : RSS o Standardize on one (or several !) standard tags o Tags

Solution : RSS o Standardize on one (or several !) standard tags o Tags are machine-readable to identify news items in a list of web sites o RSS 2. 0 o Really Simple Syndication o Rich Site Summary o Atom - a more recent format o Differences - dates (RFC 822 v RFC 3339 timestamps), multi-lingual content Characteristics o Structure: rss / channel / item Tree o Items in reverse chronological order o Few mandatory tags o Namespaces allow additional vocabularies to be added

Example RSS - UWE news <? xml version="1. 0" encoding="iso-8859 -1"? > <rss version="2.

Example RSS - UWE news <? xml version="1. 0" encoding="iso-8859 -1"? > <rss version="2. 0"> <channel> <title>UWE News</title> <link>http: //www. uwe. ac. uk</link> <description>Latest UWE press releases</description> <image> <url>http: //info. uwe. ac. uk/common/assets/2004 Design/logo. No. Border. gif</url> <title>University of the West of England</title> <link>http: //www. uwe. ac. uk</link> </image> <pub. Date>Fri, 13 Oct 2008 15: 44 GMT</pub. Date> <item> <title>New research looks to transport users for solutions</title> <link>http: //info. uwe. ac. uk/news/uwenews/article. asp? item=1363</link> <description>'Ideas in Transit' is a new initiative which will look to transport users' experiences and creativity as a source of innovation to tackle the UK's transport problems. . </description> </item>

Example RSS - BBC Finance News <? xml version="1. 0" encoding="ISO-8859 -1" ? >

Example RSS - BBC Finance News <? xml version="1. 0" encoding="ISO-8859 -1" ? > <? xml-stylesheet title="XSL_formatting" type=" text/xsl“ href="/shared/bsp/xsl/rss/nolsol. xsl"? > <rss version="2. 0" xmlns: media="http: //search. yahoo. com/mrss "> <channel> <title>BBC News | Business | UK Edition</title > <link>http: //news. bbc. co. uk/go/rss/-/1/hi/business/default. stm</link > <description>Visit BBC News for up-to-the-minute news, breaking news, video, audio and feature stories. BBC News provides trusted World and UK news as well as local and regional perspectives. Also entertainment, business, science, technology and health news. </description> <language>en-gb</language > <last. Build. Date>Mon, 13 Oct 2008 14: 28: 30 GMT</last. Build. Date> <copyright>Copyright: (C) British Broadcasting Corporation, see http: //news. bbc. co. uk/1/hi/help/rss/4498287. stm for terms and conditions of reuse </copyright> <docs>http: //www. bbc. co. uk/syndication/</docs > <ttl>15</ttl> <image> <title>BBC News</title> <url>http: //news. bbc. co. uk/nol/shared/img/bbc_news_120 x 60. gif</url > <link>http: //news. bbc. co. uk/go/rss/-/1/hi/business/default. stm</link > </image> <item> <title>UK banks receive £ 37 bn bail-out</title> <description>The UK government says it is to inject a total of up to £ 37 bn into Royal …. . </item>

RSS aggregation Problem How to keep track of multiple feeds Solution http: //www. youtube.

RSS aggregation Problem How to keep track of multiple feeds Solution http: //www. youtube. com/watch? v=0 klg. Ls. Sx. Gs. U&feature=player_embed ded#t=0 s o Application needed which is stateful – remembers what items you have read o Integrates multiple feeds into one ‘magazine’ o Polls RSS providers on a regular basis Feed integrators Bloglines, Google Reader, reduce the load on the provider and provide some filtering There is an RSS reader integrated into My. UWE RSS Aggregation with Bloglines

RSS as a tree structure o UWE news o BBC Finance news o Earthquakes

RSS as a tree structure o UWE news o BBC Finance news o Earthquakes

XML Characteristics o strings enclosed in tags which provide a humanly readable name for

XML Characteristics o strings enclosed in tags which provide a humanly readable name for the element - so-called self-describing o elements may be nested to create hierarchical data structures o element tags may be repeated o element names can be relative to their parent o element structure can be formally defined

Aside: Self -describing o Element names provide a clue about the meaning of the

Aside: Self -describing o Element names provide a clue about the meaning of the data, but not enough o names are ambiguous o names may be misleading o what units? o what accuracy? o what origin? - leads to need for meta-data o who created o when o what license to use o why

XML terminology XML documents are tree-structures, with each node bounded by an open and

XML terminology XML documents are tree-structures, with each node bounded by an open and a closing tag o Element: the opening tag, attributes, the body of the element and the closing tag. Elements are not elemental! o tag name: the name in angle brackets - must conform to rules, may have a prefix o Attribute: a name="value" pair attached to an element. Names follow the same rules as tag names. o Parent: all elments except the root have one parent o Child: an element nested in another parent element o Root: every document has a single root element with no parent o Mixed Content: an element may contain a mixure of text and other elements

Basic XML rules o A single root element o Tags must be properly nested

Basic XML rules o A single root element o Tags must be properly nested o An element must be closed: o Open and closing tag <p>. . . </p> o Empty element or <hr size="3"/> Other formatting rules o XML names are case sensitive, no spaces, restricted character set o Attribute values must be single or double-quoted o Special characters coded as references &#10 (a line feed) > > o Some characters have special meaning e. g. < is the start of a tagwithin XML data, & is the first character of an entity reference. In XML data these have to be encoded as < and & or enclosed in <[CDATA[. . ]]> o Preferably use standard formats for representing values e. g. 2008 -10 -14 for a date