Processing XML with Java Representation and Management of

  • Slides: 66
Download presentation
Processing XML with Java Representation and Management of Data on the Internet 1

Processing XML with Java Representation and Management of Data on the Internet 1

XML • XML is e. Xtensible Markup Language • It is a metalanguage: –

XML • XML is e. Xtensible Markup Language • It is a metalanguage: – A language used to describe other languages using “markup” tags that describe properties of the data • Designed to be structured – Strict rules about how data can be formatted • Designed to be extensible – Can define own terms and markup 2

XML Family • XML is an official recommendation of the W 3 C •

XML Family • XML is an official recommendation of the W 3 C • Aims to accomplish what HTML cannot and be simpler to use and implement than SGML XHTML SGML HTML XML 3

The Essence of XML • Syntax: The permitted arrangement or structure of letters and

The Essence of XML • Syntax: The permitted arrangement or structure of letters and words in a language as defined by a grammar (XML) • Semantics: The meaning of letters or words in a language • XML uses Syntax to add Semantics to the documents 4

Using XML • In XML there is a separation of the content from the

Using XML • In XML there is a separation of the content from the display • XML can be used for: – Data representation – Data exchange 5

Databases and XML • Database content can be presented in XML – XML processor

Databases and XML • Database content can be presented in XML – XML processor can access DBMS or file system and convert data to XML – Web server can serve content as either XML or HTML 6

HTML vs. XML <B><I>improper nesting</B></I> <B><I>proper nesting</I></B> allow start tags, without end tags like

HTML vs. XML <B><I>improper nesting</B></I> <B><I>proper nesting</I></B> allow start tags, without end tags like <BR> empty tags must have a trailing slash, as in <BR/> <font color=blue>unquoted attribute values</font> <font color=“blue">quoted attribute values</font> <B>HTML is case insensitive</b> <b>XML is case sensitive</b> Whitespace is ignored Whitespace is important Begins with <html> Begins with <? xml version=‘ 1. 0’ ? > 7

HTML vs. XML Well defined set of tags Can use any tag you like

HTML vs. XML Well defined set of tags Can use any tag you like tags have a known meaning tags have no known meaning 8

Some Things in Common • Comments are allowed - <!-- … --> • Special

Some Things in Common • Comments are allowed - <!-- … --> • Special characters must be escaped (e. g. , > for >) 9

Processing XML – The Idea 10

Processing XML – The Idea 10

Sample Document <transaction> <account>89 -344</account> <buy shares=“ 100”> <ticker exch=“NASDAQ”>WEBM</ticker> </buy> <sell shares=“ 30”>

Sample Document <transaction> <account>89 -344</account> <buy shares=“ 100”> <ticker exch=“NASDAQ”>WEBM</ticker> </buy> <sell shares=“ 30”> <ticker exch=“NYSE”>GE</ticker> </sell> </transaction> 11

DOM Parser • DOM = Document Object Model • Parser creates a tree object

DOM Parser • DOM = Document Object Model • Parser creates a tree object out of the document • User accesses data by traversing the tree • The API allows for constructing, accessing and manipulating the structure and content of XML documents 12

Document as Tree Methods like: transaction get. Root account buy sell 89 -344 shares

Document as Tree Methods like: transaction get. Root account buy sell 89 -344 shares 100 shares ticker get. Attributes etc. ticker 30 exch NASDAQ get. Children exch WEBM NYSE GE 13

Advantages and Disadvantages • Advantages: – Natural and relatively easy to use – Can

Advantages and Disadvantages • Advantages: – Natural and relatively easy to use – Can repeatedly traverse tree • Disadvantages: – High memory requirements – the whole document is kept in memory – Must parse the whole document before use 14

SAX Parser • SAX = Simple API for XML • Parser creates “events” while

SAX Parser • SAX = Simple API for XML • Parser creates “events” while traversing tree • Parser calls methods (that you write) to deal with the events • Similar to an IOStream, goes in one direction 15

Document as Events <transaction> End tag: account Start tag: transaction Text: 89 -344 account

Document as Events <transaction> End tag: account Start tag: transaction Text: 89 -344 account <account>89 -344</account> Value: Attribute: shares Start tag: 100 buy <buy shares=“ 100”> <ticker exch=“NASDAQ”>WEBM</ticker> </buy> <sell shares=“ 30”> <ticker exch=“NYSE”>GE</ticker> </sell> </transaction> 16

Advantages and Disadvantages • Advantages: – Requires little memory – Fast • Disadvantages: –

Advantages and Disadvantages • Advantages: – Requires little memory – Fast • Disadvantages: – Cannot reread – Less natural for object oriented programmers (perhaps) 17

Which should we use? DOM vs. SAX • If your document is very large

Which should we use? DOM vs. SAX • If your document is very large and you only need a few elements - use SAX • If you need to manipulate (i. e. , change) the XML - use DOM • If you need to access the XML many times - use DOM 18

XML Parsers 19

XML Parsers 19

XML Parsers • There are several different ways to categorise parsers: – Validating versus

XML Parsers • There are several different ways to categorise parsers: – Validating versus non-validating parsers – DOM parsers versus SAX parsers – Parsers written in a particular language (Java, C++, Perl, etc. ) 20

Validating Parsers • A validating parser makes sure that the document conforms to the

Validating Parsers • A validating parser makes sure that the document conforms to the specified DTD • This is time consuming, so a non-validating parser is faster 21

Using an XML Parser • Three basic steps – Create a parser object –

Using an XML Parser • Three basic steps – Create a parser object – Pass the XML document to the parser – Process the results • Generally, writing out XML is not in the scope of parsers (though some may implement proprietary mechanisms) 22

SAX – Simple API for XML 23

SAX – Simple API for XML 23

The SAX Parser • SAX parser is an event-driven API – An XML document

The SAX Parser • SAX parser is an event-driven API – An XML document is sent to the SAX parser – The XML file is read sequentially – The parser notifies the class when events happen, including errors – The events are handled by the implemented API methods to handle events that the programmer implemented 24

Used to create a SAX Parser Handles document events: start tag, end tag, etc.

Used to create a SAX Parser Handles document events: start tag, end tag, etc. Handles Parser Errors Handles DTDs and Entities 25

Problem • The SAX interface is an accepted standard • There are many implementations

Problem • The SAX interface is an accepted standard • There are many implementations • Like to be able to change the implementation used without changing any code in the program • How is this done? 26

Factory Design Pattern • Have a “Factory” class that creates the actual Parsers. •

Factory Design Pattern • Have a “Factory” class that creates the actual Parsers. • The Factory checks the value of a system property that states which implementation should be used • In order to change the implementation, simply change the system property 27

Creating a SAX Parser • Import the following packages: – org. xml. sax. *;

Creating a SAX Parser • Import the following packages: – org. xml. sax. *; – org. xml. sax. helpers. *; • Set the following system property: – System. set. Property("org. xml. sax. driver", "org. apache. xerces. parsers. SAXParser"); • Create the instance from the Factory: – XMLReader reader = XMLReader. Factory. create. XMLReader(); 28

Receiving Parsing Information • A SAX Parser calls methods such as “start. Document”, “start.

Receiving Parsing Information • A SAX Parser calls methods such as “start. Document”, “start. Element”, etc. , as it runs • In order to react to such events we must: – implement the Content. Handler interface – set the parser’s content handler with an instance of our class 29

Content. Handler // Methods (partial list) public void start. Document(); public void end. Document();

Content. Handler // Methods (partial list) public void start. Document(); public void end. Document(); public void characters(char[] ch, int start, int length); public void start. Element(String namespace. URI, String local. Name, String q. Name, Attributes atts); public void end. Element(String namespace. URI, String local. Name, String q. Name); 30

Namespaces and Element Names <? xml version='1. 0' encoding='utf-8'? > <forsale date="12/2/03" xmlns: xhtml

Namespaces and Element Names <? xml version='1. 0' encoding='utf-8'? > <forsale date="12/2/03" xmlns: xhtml = "urn: http: //www. w 3. org/1999/xhtml"> <book> <title> <xhtml: em> DBI: </xhtml: em> The Course I Wish I never Took </title> <comment> My <xhtml: b> favorite </xhtml: b> book! </comment> </book> </forsale> 31

Namespaces and Element Names namespace. URI = "" <? xml version='1. 0' encoding='utf-8'? >

Namespaces and Element Names namespace. URI = "" <? xml version='1. 0' encoding='utf-8'? > local. Name = book q. Name = book <forsale date="12/2/03" xmlns: xhtml = "urn: http: //www. w 3. org/1999/xhtml"> <book> <title> <xhtml: em> DBI: </xhtml: em> The Course I Wish I never Took </title> namespace. URI = <comment> My <xhtml: b> favorite </xhtml: b> book! urn: http: //www. w 3. org/1999/xhtml </comment> local. Name = em q. Name = xhtml: em </book> </forsale> 32

Receiving Parsing Information (cont(. • An easy way to implement the Content. Handler interface

Receiving Parsing Information (cont(. • An easy way to implement the Content. Handler interface is the extend the Default. Handler, which implements this interface (and a few others) in an empty fashion • To actually parse a document, create an Input. Source from the document and supply the input source to the parse method of the XMLReader 33

import java. io. *; import org. xml. sax. helpers. *; public class Info. With.

import java. io. *; import org. xml. sax. helpers. *; public class Info. With. Sax extends Default. Handler { public static void main(String[] args) { System. set. Property("org. xml. sax. driver", "org. apache. xerces. parsers. SAXParser"); try { XMLReader reader = XMLReader. Factory. create. XMLReader(); reader. set. Content. Handler(new Info. With. Sax()); reader. parse(new Input. Source(new File. Reader(args[0]))); } catch(Exception e) { e. print. Stack. Trace()} } 34

public static start. Document() throws SAXException { System. out. println(“START DOCUMENT”); } public static

public static start. Document() throws SAXException { System. out. println(“START DOCUMENT”); } public static end. Document() throws SAXException { System. out. println(“END DOCUMENT”); } int depth; String indent = “ ”; private void println(String header, String value) { for (int i = 0 ; i < depth ; i++) System. out. print(indent); System. out. println(header + ": " + value); } 35

public void characters(char buf[], int offset, int len) throws SAXException { String s =

public void characters(char buf[], int offset, int len) throws SAXException { String s = (new String(buf, offset, len)). trim(); if (!"". equals(s)) println("CHARACTERS", s); } public void end. Element(String namespace. URI, String local. Name, String name) throws SAXException { depth--; String element. Name = name; if (!"". equals(namespace. URI) && !"". equals(local. Name)) element. Name = namespace. URI + ": " + local. Name; println("END ELEMENT", element. Name); } 36

public static start. Element(String namespace. URI, String local. Name, String name, Attributes attrs) throws

public static start. Element(String namespace. URI, String local. Name, String name, Attributes attrs) throws SAXException { String element. Name = name; if (!"". equals(namespace. URI) && !"". equals(local. Name)) element. Name = namespace. URI + ": " + local. Name; println("START ELEMENT", element. Name); if (attrs != null && attrs. get. Length() > 0) { for (int i = 0; i < attrs. get. Length(); i++) println("ATTRIBUTE", attrs. get. Local. Name(i) + “=” + attrs. get. Value(i)); } depth++; } Example Input Example Output 37

Bachelor Tags • What do you think happens when the parser parses a bachelor

Bachelor Tags • What do you think happens when the parser parses a bachelor tag? <rating stars="five" /> 38

Attributes Interface • Elements may have attributes • There is no distinction between attributes

Attributes Interface • Elements may have attributes • There is no distinction between attributes that are defined explicitly from those that are specified in the DTD (with a default value( 39

Attributes Interface (cont(. • int get. Length(); • String get. QName(int i); • String

Attributes Interface (cont(. • int get. Length(); • String get. QName(int i); • String get. Type(int i); • String get. Value(int i); • String get. Type(String qname); • String get. Value(String qname); • etc. 40

Attributes Types • The following are possible types for attributes: – "CDATA", – "IDREF",

Attributes Types • The following are possible types for attributes: – "CDATA", – "IDREF", "IDREFS", – "NMTOKEN", "NMTOKENS", – "ENTITY", "ENTITIES", – "NOTATION" 41

Setting Features • It is possible to set the features of a parser using

Setting Features • It is possible to set the features of a parser using the set. Feature method. • Examples: – reader. set. Feature(“http: //xml. org/sax/features/nam espaces”, true) – reader. set. Feature(“http: //xml. org/sax/features/vali dation", false) • For a full list, see: http: //www. saxproject. org/? selected=get-set 42

Error. Handler Interface • We implement Error. Handler to receive error events (similar to

Error. Handler Interface • We implement Error. Handler to receive error events (similar to implementing Content. Handler) • Default. Handler implements Error. Handler in an empty fashion, so we can extend it (as before) • An Error. Handler is registered with – reader. set. Error. Handler(handler); • Three methods: – void error(SAXParse. Exception ex); – void fatal. Error(SAXParser. Excpetion ex); – void warning(SAXParser. Exception ex); 43

Extending the Info. With. Sax Program public void warning(SAXParse. Exception err) throws SAXException {

Extending the Info. With. Sax Program public void warning(SAXParse. Exception err) throws SAXException { System. out. println(“Warning in line” + err. get. Line. Number() + “ and column ” + err. get. Column. Number()); } public void error(SAXParse. Exception err) throws SAXException { System. out. println(“Oy va’avoi, an error!”); } public void fatal. Error(SAXParse. Exception err) Will these methods be called in the case of a problem? throws SAXException { System. out. println(“OY VA’AVOI, a fatal error!”); } 44

Lexical Events • Lexical events have to do with the way that a document

Lexical Events • Lexical events have to do with the way that a document was written and not with its content • Examples: – A comment is a lexical event (<!-- comment -->) – The use of an entity is a lexical event (> ) • These can be dealt with by implementing the Lexical. Handler interface, and set on a parser by – reader. set. Property("http: //xml. org/sax/properties/ lexical-handler", mylexicalhandler); 45

Lexical. Handler // Methods (partial list) public void start. Entity(String name); public void end.

Lexical. Handler // Methods (partial list) public void start. Entity(String name); public void end. Entity(String name); public void comment(char[] ch, int start, int length); public void start. CDATA(); public void end. CDATA(); 46

DOM – Document Object Model 47

DOM – Document Object Model 47

Creating a DOM Tree • How can we create a DOM Tree independently of

Creating a DOM Tree • How can we create a DOM Tree independently of the implementation chosen? • Creating a DOM Tree using the Apache Xerces package: – Import: org. apache. xerces. parsers. DOMParser – Import: org. w 3 c. dom. *; – Use the following lines of code: DOMParser dom = new DOMParser(); dom. parse(file. Name); Document doc = dom. get. Document(); 48

Using a DOM Tree XML File DOM Parser DOM Tree A P I Application

Using a DOM Tree XML File DOM Parser DOM Tree A P I Application 49

Figure as appears in : “The XML Companion” - Neil Bradley Nodes in a

Figure as appears in : “The XML Companion” - Neil Bradley Nodes in a DOM Tree Document. Fragment Document Character. Data Attr Node Text CDATASection Comment Element Document. Type Notation Node. List Entity Named. Node. Map Entity. Reference Processing. Instruction Document. Type 50

DOM Tree Document Type Attribute Text Attribute Element Comment Element Entity Reference Element Text

DOM Tree Document Type Attribute Text Attribute Element Comment Element Entity Reference Element Text 51

Normalizing a Tree • Normalizing a DOM Tree has two effects: – Combine adjacent

Normalizing a Tree • Normalizing a DOM Tree has two effects: – Combine adjacent textual nodes – Eliminate empty textual nodes • To normalize, apply the normalize() method to the document element 52

Node Methods • Three categories of methods – Node characteristics: name, type, value –

Node Methods • Three categories of methods – Node characteristics: name, type, value – Contextual location and access to relatives: parents, siblings, children, ancestors, descendants – Node modification: Edit, delete, re-arrange child nodes 53

Node Methods (2( • short get. Node. Type(); • String get. Node. Name(); •

Node Methods (2( • short get. Node. Type(); • String get. Node. Name(); • String get. Node. Value() throws DOMException; • void set. Node. Value(String value) throws DOMException; • boolean has. Child. Nodes(); • Named. Node. Map get. Attributes(); • Document get. Owner. Document(); 54

Node Types - get. Node. Type() ELEMENT_NODE = 1 PROCESSING_INSTRUCTION_NODE = 7 ATTRIBUTE_NODE =

Node Types - get. Node. Type() ELEMENT_NODE = 1 PROCESSING_INSTRUCTION_NODE = 7 ATTRIBUTE_NODE = 2 COMMENT_NODE = 8 TEXT_NODE = 3 DOCUMENT_NODE = 9 CDATA_SECTION_NODE = 4 DOCUMENT_TYPE_NODE = 10 ENTITY_REFERENCE_NODE = 5 DOCUMENT_FRAGMENT_NODE = 11 ENTITY_NODE = 6 NOTATION_NODE = 12 if (my. Node. get. Node. Type() == Node. ELEMENT_NODE) { //process node … } 55

56

56

Node Navigation • Every node has a specific location in tree • Node interface

Node Navigation • Every node has a specific location in tree • Node interface specifies methods to find surrounding nodes – Node get. First. Child(); – Node get. Last. Child(); – Node get. Next. Sibling(); – Node get. Previous. Sibling(); – Node get. Parent. Node(); – Node. List get. Child. Nodes(); 57

Figure as from “The XML Companion” - Neil Bradley Node Navigation (2( get. Previous.

Figure as from “The XML Companion” - Neil Bradley Node Navigation (2( get. Previous. Sibling() get. Parent. Node() get. First. Child() get. Child. Nodes() get. Last. Child() get. Next. Sibling() 58

import org. apache. xerces. parsers. DOMParser; import org. w 3 c. dom. *; public

import org. apache. xerces. parsers. DOMParser; import org. w 3 c. dom. *; public class Info. With. Dom { public static void main(String[] args) { try { DOMParser dom = new DOMParser(); dom. parse(args[0]); Document doc = dom. get. Document(); new Info. With. Dom(). echo(doc); } catch(Exception e) { e. print. Stack. Trace()} } 59

private int depth = 0; private final String indent = " "; private String[]

private int depth = 0; private final String indent = " "; private String[] NODE_TYPES = {"", "ELEMENT", "ATTRIBUTE", "TEXT", "CDATA", "ENTITY_REF", "ENTITY", "PROCESSING_INST", "COMMENT", "DOCUMENT_TYPE", "DOCUMENT_FRAG", "NOTATION"}; private void output. Indentation() { for (int i = 0; i < depth; i++) System. out. print(indent); } 60

private void println. Common(Node n) { System. out. print(NODE_TYPES[n. get. Node. Type()] + ":

private void println. Common(Node n) { System. out. print(NODE_TYPES[n. get. Node. Type()] + ": "); System. out. print(" node. Name=" + n. get. Node. Name()); String val; if ((val = n. get. Namespace. URI()) != null) System. out. print(" uri=" + val); if ((val = n. get. Prefix()) != null) System. out. print(" pre=" + val); if ((val = n. get. Local. Name()) != null) System. out. print(" local=" + val); if ((val = n. get. Node. Value()) != null && !val. trim(). equals("")) System. out. print(" node. Value=" + val); System. out. println(); } 61

private void echo(Node n) { output. Indentation(); println. Common(n); if (n. get. Node. Type()

private void echo(Node n) { output. Indentation(); println. Common(n); if (n. get. Node. Type() == Node. ELEMENT_NODE) { Named. Node. Map atts = n. get. Attributes(); indent += 2; for (int i = 0; i < atts. get. Length(); i++) echo(atts. item(i)); indent -= 2; } indent++; for (Node child = n. get. First. Child(); child != null; child = child. get. Next. Sibling()) echo(child); indent--; } Example Input Example Output 62

Node Manipulation • Children of a node in a DOM tree can be manipulated

Node Manipulation • Children of a node in a DOM tree can be manipulated - added, edited, deleted, moved, copied, etc. Node remove. Child(Node old) throws DOMException; Node insert. Before(Node new, Node ref) throws DOMException; Node append. Child(Node new) throws DOMException; Node replace. Child(Node new, Node old) throws DOMException; Node clone. Node(boolean deep); 63

Figure as appears in “The XML Companion” - Neil Bradley Node Manipulation (2( Old

Figure as appears in “The XML Companion” - Neil Bradley Node Manipulation (2( Old New insert. Before Ref New replace. Child Shallow 'false' clone. Node Deep 'true' 64

Other Interfaces • We have discussed methods of the Node interface • Each of

Other Interfaces • We have discussed methods of the Node interface • Each of the "specific types of nodes" have additional methods • See API for details 65

Note about DOM Objects • DOM object compiled XML • Can save time and

Note about DOM Objects • DOM object compiled XML • Can save time and effort if send and receive DOM objects instead of XML source – Saves having to parse XML files into DOM at sender and receiver – But, DOM object may be larger than XML source 66