XML Transformations and Contentbased Crawling Zachary G Ives

XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems

Reminders § Homework 2 “release” version is now on the Web site § § Simple web crawling XPath XSLT Storage (Berkeley DB) § Milestone 1 due March 1 § Milestone 2 due March 8 2

More than XPath § XPath identifies or extracts subtrees from an XML document § … But there are lots of cases where we wantconvert to from XML, or something else § § XML text (document extraction) XML HTML XML SVG etc. § Here we need something more – often XSLT 3

A Functional Language for XML § XSLT is based on a series of templates that match different parts of an XML document § There’s a policy for what rule or template is applied if more than one matches (it’s not what you’d think!) § XSLT templates can invoke other templates § XSLT templates can be nonterminating (beware!) § XSLT templates are based on XPath “match”es, and we can also apply other templates (potentially to “select”ed XPaths) § Within each template, directly describe what should be output 4

An XSLT Template § An XML document itself § XML tags create output. OR are XSL operations § All XSL tags are prefixed with “xsl” namespace § All non-XSL tags are part of the XML output § Common XSL operations: § template with a match XPath § Recursive call toapply-templates, which may alsoselectwhere it should be applied § Attach to XML document with a processing-instruction: <? xml version= “ 1. 0” ? > <? xml-stylesheet type =“text/xsl”href=“http: //www. com/my. xsl ” ? > 5

An Example XSLT Stylesheet <xsl: stylesheetversion=“ 1. 1”> <xsl: template match=“/dblp”> <html><head>This is DBLP</head> <body> <xsl: apply-templates/> </body> </html> </xsl: template> <xsl: template match=“article”> <h 2><xsl: apply-templates select =“title” /></h 2> <p><xsl: apply-templates select =“author”/></p> </xsl: template> … </xsl: stylesheet > 6

XML Data Root ? xml 2002… element article mdate author title year school 1992 key editor title journal volume year ee ee 2002… tr/dec/… PRPL… Kurt P…. p-i dblp key ms/Brown 92 attribute text mastersthesis mdate root Digital… Univ…. 1997 The… Paul R. db/labs/dec SRC… http: //www. 7

XSLT Processing Model § List of source nodes result tree fragment(s) § Start with root § Find all template rules with matching patterns from root Find “best” match according to some heuristics Set the current node list to be the set of things it maches § Iterate over each node in the current node list Apply the operations of the template “Append” the results of the matching template rule to the result tree structure s Repeat recursively if specified to apply-templates by 8

What If There’s More than One Match? § § Eliminate rules of lower precedence due to importing Break a rule into any | branches and consider separately Choose rule with highest computed or specified priority Simple rules for computing priority based on “precision”: § § QName preceded by XPath child/axis specifier: priority 0 NCName preceded by child/axis specifier: priority -0. 25 Node. Test preceded by child/axis specifier: pririty -0. 5 else priority 0. 5 9

Other Common Operations § Iteration: <xsl: for-each select =“path”> </xsl: for-each> § Conditionals: <xsl: if test=“. /text() < ‘abc’”> </xsl: if> § Copying current node and children to the result set: <xsl: copy> <xsl: apply-templates/> </xsl: copy> 10

Creating Output Nodes § Return text/attribute data (this is a default rule): <xsl: template match=“text()|@*”> <xsl: value-of select =“. ”/> </xsl: template> § Create an element from text (attribute is similar): <xsl: element name=“text()”> <xsl: apply-templates /> </xsl: element> § Copy nodes matching a path <xsl: copy-of select =“*”/> 11

Embedding Stylesheets § You can “import” or “include” one stylesheet from another: <xsl: import href=“ http: //www. com/my. xsl /”> <xsl: include href=“ http: //www. com/my. xsl /”> § “Include”: the rules get same precedence as in including template § “Import”: the rules are given lower precedence 12

XSLT Summary § A very powerful, template-based transformation language for XML document other structured document § Commonly used to convert XML PDF, SVG, Graph. Viz DOT format, HTML, WML, … § Primarily useful for presentation of XML or for very simple conversions What if we want to: § § § Manage and combinecollections of XML documents? Make Web service requests for XML? “Glue together” different Web service requests? Query for keywords within documents, with ranked answers This is where XQuery plays a role – see CIS 330 / 550 for details 13

Now… How Do We Crawl the Web and Get Data? § A few remarks on basic crawlers… § … Then an XML-specific crawler 14

Crawling the Web: The Basic Process § Start with some initial page. P 0 § Collect all URLs from P 0 and add to the crawler queue § Consider <base href> tag, anchor links, optionally image links, CSS, DTDs, scripts § Considerations: § § § What order to traverse (polite to do BFS – why? ) How deep to traverse What to ignore (coverage) How to escape “spider traps” and avoid cycles How often to crawl 15

Essential Crawler Etiquette § Robot exclusion protocols § First, ignore pages with: <META NAME="ROBOTS” CONTENT="NOINDEX"> § Second, look forrobots. txtat root of web server § See http: //www. robotstxt. org/wc/robots. html § To exclude all robots from a server: User-agent: * Disallow: / § To exclude one robot from two directories: User-agent: Bobs. Crawler Disallow: /news/ Disallow: /tmp/ 16

Suppose We Want to Crawl XML Docum Based on User Interests § We need several parts: § A list of “interests” – expressed in an executable form, perhaps XPath queries § A crawler – goes out and fetches XML content § A filter / routing engine – matches XML content against users’ interests, sends them the content if it matches 17

XML-Based Information Dissemination Basic model (XFilter, YFilter, Xyleme): § Users are interested in data relating to a particular topic, and know the schema /politics/usa//body § A crawler-aggregator reads XML files from the web (or gets them from data sources) and feeds them to interested parties 18

Engine for XFilter [Altinel & Franklin 00] 19

How Does It Work? § Each XPath segment is basically a subset of regular expressions over element tags § Convert into finite state automata § Parse data as it comes in – use SAX API § Match against finite state machines § Most of these systems use modified FSMs because they want to matchmany patterns at the same time 20

Path Nodes and FSMs § XPath parser decomposes XPath expressions into a set of path nodes § These nodes act as the states of corresponding FSM § A node in the Candidate List denotes the current state § The rest of the states are in corresponding Wait Lists § Simple FSM for/politics[@topic=“president”]/usa//body : politics Q 1_1 usa Q 1_2 body Q 1_3 21

Decomposing Into Path Nodes Q 1=/politics[@topic=“president”]/usa//body Query ID Q 1 Q 1 Position in state machine 1 2 3 Relative Position (RP) in tree: 0 for root node if it’s not preceded 0 1 -1 by “//” 1 2 -1 -1 for any node preceded by “//” Q 1 -1 Q 1 -2 Q 1 -3 Q 2=//usa/*/body/p Q 2 1 2 -1 0 Q 2 3 1 0 Q 2 -1 Q 2 -2 Q 2 -3 Else =1+ (no of “*” nodes from predecessor node) Level: If current node has fixed distance from root, then 1+ distance Else if RP = – 1, then – 1, else 0 Finaly, Next. Path. Node. Set points to nex node 22

Query Index politics usa body p Q 1 -1 X X Q 2 -1 X Q 1 -2 X CL WL X Q 1 -3 Q 2 -2 X Q 2 -3 X X § Query index entry for each XML tag § Two lists: Candidate List (CL) and Wait List (WL) divided across the nodes § “Live” queries’ states are in CL; “pending” queries + states are in WL § Events that cause state transition are generated by the XML parser 23

Encountering an Element § Look up the element name in the Query Index and all nodes in the associated CL § Validate that we actually have a match start. Element: politics Entry in Query Index: politics Q 1 -1 X X CL WL Q 1 1 0 1 Q 1 -1 Query ID Position Rel. Position Level Next. Path. Node. Set 24

Validating a Match § We first check that the current XML depth matches the level in the user query: § If level in CL node is less than 1, then ignore height § else level in CL node must = height § This ensures we’re matching at the right point in the tree! § Finally, we validate any predicates against attributes (e. g. , [@topic=“president”]) 25

Processing Further Elements § Queries that don’t meet validation are removed from the Candidate Lists § For other queries, we advance to the next state § We copy the next node of the query from the WL to the CL, and update the RP and level § When we reach a final state (e. g. , Q 1 -3), we can output the document to the subscriber § When we encounter an end element, we must remove that element from the CL 26

Publish-Subscribe Model Summarized § Well-suited to an XML format called RSS (Rich Site Summary or Really Simple Syndication) Many news sites, web logs, mailing lists, etc. use RSS to publish daily articles § Seems like a perfect fit for publish-subscribe models! 27