SemiStructured Data and XML By Krutika Thakur ME

Semi-Structured Data and XML By Krutika Thakur ME 1 st yr (521001)

Semi Structured Data • In some applications, data is collected in an adhoc manner before it is known how it will be stored and managed. • This data may have a certain structure, but not all the information collected will have identical structure. This type of data is known as semi-structured data. • In it the schema information is mixed in with the data values, since each data object can have different attributes that are not known in advance. Hence, this type of data is sometimes referred to as self-describing data.

Semi-structured data may be displayed as a directed graph. . •

Introduction to XML stands for Extensible Markup Language. • A markup language is used to provide information about a document. • Tags are added to the document to provide the extra information • HTML tags tell a browser how to display the document. • XML tags give a reader some idea what some of the data means. •

XML Rules • Tags are enclosed in angle brackets. • Tags come in pairs with start-tags and end-tags. • Tags must be properly nested. <name><email>…</name></email> is not allowed. <name><email>…</email><name> is. • Tags that do not have end-tags must be terminated by a ‘/’. is an html example. • Tags are case sensitive. <address> is not the same as <Address> XML in any combination of cases is not allowed as part of a tag. • Tags may not contain ‘<‘ or ‘&’. •

XML hierarchical Model • The basic object is XML is the XML document. • There are two main structuring concepts that are used to construct an XML document: § Attributes §Elements • Attributes in XML provide additional information that describe elements.

As in HTML, elements are identified in a document by their start tag and end tag. • o The tag names are enclosed between angled brackets <…>, and end tags are further identified by a backslash </…>. Complex elements are constructed from other elements hierarchically, whereas simple elements contain data values. • FIGURE A complex XML element called <projects>

XML Hierarchical (Tree) Data Model • It is possible to characterize three main types of XML documents : 1. Data-centric XML documents : These documents have many small data items that follow a specific structure, and hence may be extracted from a structured database. They are formatted as XML documents in order to exchange them or display them over the Web. 2. Document-centric XML documents: These are documents with large amounts of text, such as news articles or books. There is little or no structured data elements in these documents. 3. Hybrid XML documents : These documents may have parts that contains structured data and other parts that are unstructured.

Well-Formed XML : • It must start with an XML declaration to indicate the version of XML being used as well as any other relevant attributes. • It must follow the syntactic guidelines of the tree model. § This means that there should be a single root element, and every element must include a matching pair of start tag and end tag within the start and end tags of the parent element q

Valid XML : • A stronger criterion is for an XML document to be valid. • In this case, the document must be well-formed, and in addition the element names used in the start and end tag pairs must follow the structure specified in a separate XML DTD (Document Type Definition) file or XML schema file. q

XML Schemas “Schemas” is a general term--DTDs are a form of XML schemas • According to the dictionary, a schema is “a structured framework or plan” • When we say “XML Schemas, ” we usually mean the W 3 C XML Schema Language • ØThis is also known as “XML Schema Definition” language, or XSD Ø I’ll use “XSD” frequently, because it’s short DTDs, XML Schemas, and RELAX NG are all XML schema languages •

• A flexible and powerful schema language • Syntax is XML itself • Variety of data types and ability to extend type system Variety of data “facets” and “patterns” to impose domain constraints • Can define advanced constraints such as “primary key” and “referential integrity” •

• The <schema> element may have attributes: xmlns: xs=http: //www. w 3. org/2001/XMLSchem • This is necessary to specify where all our XSD tags are defined element. Form. Default="qualified" • This means that all XML elements must be qualified (use a namespace) • It is highly desirable to qualify all elements, or problems will arise when another schema is added

XML DTD and XML Schema v Document Type Definitions (DTDs) : ØDocument Type Definition; A way to specify the structure of XML documents. Ø A DTD adds syntactical requirements in addition to the well-formed requirement. ØDTDs help in • Eliminating errors when creating or editing XML documents • Simplifying the processing of XML documents. Ø Uses “regular expression” like syntax to specify a grammar for the XML document. Ø Syntax : <!DOCTYPE element DTD identifier [ declaration 1 declaration 2. . . . ]>

Parsers An XML parser is an API that reads the content of an XML document • §Currently popular APIs are DOM (Document Object Model) and SAX (Simple API for XML) A validating parser is an XML parser that compares the XML document to a DTD and reports any errors • §Most browsers don’t use validating parsers

A DTD Example <? xml version="1. 0" encoding="UTF-8" standalone="yes"? > <!DOCTYPE address [ <!ELEMENT address (name, company, phone)> <!ELEMENT name (#PCDATA)> <!ELEMENT company (#PCDATA)> <!ELEMENT phone (#PCDATA)> ]> <address> <name>Tanmay Patil</name> <company>ABC</company> <phone>(011) 123 -4567</phone> </address>

XML Namespace • XML Namespace is a mechanism to avoid name conflicts by differentiating elements or attributes within an XML document that may have identical names, but different definitions. Namespace is a mapping between an element prefix and a URI §“Cars” is the prefix in this example, <cars: part xmlns: cars=“URI”> • URIs are not a pointer to information about the Namespace. They are just unique identifiers. You cannot resolve XML namespace URIs. •

Example :

Thank You
- Slides: 19