XML Processing in William Narmontas Dino Fancellu www

















































- Slides: 49
XML Processing in William Narmontas Dino Fancellu www. scala. contractors XML LONDON 2014
Dino Fancellu 35 years IT Scala • Java • XML William Narmontas 10 years IT Scala • XML • Web
What is Scala?
Scala processes XML fast
It is powerful
Modular Concise Functional Type-safe Performant Object-oriented Strongly-typed Composable Java-interoperable Statically-typed Unopinionated First-class XML
Who uses Scala? Apple e. Bay Linked. In The Guardian Bank of America e. Harmony Morgan Stanley Tom Barclays EDF Netflix Trafigura BBC Four. Square Novell Tumblr BSky. B Gawker Rackspace Twitter Cisco HSBC Sky UBS Citigroup ITV Sony VMware Credit Suisse Klout Springer Xerox
Projects in Scala - Less code to write = less to maintain - Communication clearer - Testing easier - Software robust - Time to market: fast - Happier developers
Scala language: Intro
Values Scala val conference. Name = "XML London 2014" XQuery let $conference. Name : = "XML London 2014" Scala (Mutable) var conference. Name = "XML London 2014" conference. Name = "XML London 2015"
Strings val language = "Scala" s"XML Processing in $language" | XML Processing in Scala s"""An introduction to: |The "$language" programming language""". strip. Margin | An introduction to: | The "Scala" programming language s"$language has ${language. length} chars in its name" | Scala has 5 chars in its name
Functions Scala def fun(x: Int, y: Double) = s"$x: $y" XQuery declare function local: fun( $x as xs: integer, $y as xs: double ) as xs: string { concat($x, ": ", $y) };
Everything is an expression val train. Speed = if ( train. speed. mph >= 60 ) "Fast" else "Slow" def divide(numerator: Int, denominator: Int) = try { s"${numerator/denominator}" } catch { case _: java. lang. Arithmetic. Exception => s"Cannot divide $numerator by $denominator" }
Types: Explicit def with. Title(name: String, title: String): String = s"$title. $name" val x: Int = { val y = 1000 100 + y } | x: Int = 1100
Functions: named parameters Further clarity in method calls: def make. Link(url: String, text: String) = s"""<a href="$url">$text</a>""" make. Link(text = "XML London 2014", url = "http: //www. xmllondon. com") | <a href="http: //www. xmllondon. com">XML London 2014</a>
Functions: default parameters Reduce repetition in method calls: def with. Title(name: String, title: String = "Mr") = s"$title. $name" with. Title("John Smith") | Mr. John Smith with. Title("Mary Smith", "Miss") | Miss. Mary Smith
Functional def incremented. By. One(x: Int) = x + 1 (1 to 5). map(incremented. By. One) | Vector(2, 3, 4, 5, 6)
Lambdas (1 to 5). map(x => x + 1) | Vector(2, 3, 4, 5, 6) (1 to 5). map(_ + 1) | Vector(2, 3, 4, 5, 6)
For comprehensions for { x <- (1 to 5) } yield x + 1 | Vector(2, 3, 4, 5, 6)
Implicit classes: Enrich types implicit class string. Wrapper(str: String) { def wrap. With. Parens = s"($str)" } "Text". wrap. With. Parens | (Text)
Powerful features for scalability - Case classes - Traits - Partial functions - Pattern matching - Implicits - Flexible Syntax - Generics - User defined operators - Call-by-name - Macros
Scala & XML
Values: Inline XML val url = "http: //www. xmllondon. com" val title = "XML London 2014" val xml. Tree = <div> <p>Welcome to <a href={url}>{title}</a>!</p> </div> | xml. Tree: scala. xml. Elem = | <div> | <p>Welcome to <a href="http: //www. xmllondon. com/">XML London 2014</a>!</p> | </div>
XML Lookups val list. Of. People = <people> <person>Fred</person> <person>Ron</person> <person>Nigel</person> </people> list. Of. People "person" | Node. Seq(<person>Fred</person>, <person>Ron</person>, <person>Nigel</person>) list. Of. People "_" | Node. Seq(<person>Fred</person>, <person>Ron</person>, <person>Nigel</person>)
XML Lookups val fact = <fact type="universal"> <variable>A</variable> = <variable>A</variable> </fact> fact \ "variable" | Node. Seq(<variable>A</variable>, <variable>A</variable>) fact "@type" | : scala. xml. Node. Seq = universal fact @ "type" | : String = universal
XML Loading val pun = """<pun rating="extreme"> | <question>Why do Comp. Sci students need glasses? </question> | <answer>To C#<!-- C# is a Microsoft's programming language -->. </answer> |</pun>""". strip. Margin scala. xml. XML. load. String(pun) | <pun rating="extreme"> | <question>Why do Comp. Sci students need glasses? </question> | <answer>To C#. </answer> | </pun>
Collections: expressive val root = <numbers> {for {i <- 1 to 10} yield <number>{i}</number>} </numbers> val numbers = root "number" numbers(0) | <number>1</number> numbers. head | <number>1</number> numbers. last | <number>10</number> numbers take 3 | Node. Seq(<number>1</number>, <number>2</number>, <number>3</number>)
Collections: expressive numbers filter (_. text. to. Int > 6) | Node. Seq(<number>7</number>, <number>8</number>, <number>9</number>, <number>10</number>) numbers max. By (_. text) | <number>9</number> numbers max. By (_. text. to. Int) | <number>10</number> numbers. reverse | Node. Seq(<number>10</number>, <number>9</number>, <number>8</number>, <number>7</number>, <number>6</number>, <number>5</number>, <number>4</number>, <number>3</number>, <number>2</number>, <number>1</number>) numbers. group. By(_. text. to. Int % 3) | Map( | 2 -> Node. Seq(<number>2</number>, <number>5</number>, <number>8</number>), | 1 -> Node. Seq(<number>1</number>, <number>4</number>, <number>7</number>, <number>10</number>), | 0 -> Node. Seq(<number>3</number>, <number>6</number>, <number>9</number>))
XML Methods: a rich API % : + aggregate attributes combinations copy. To. Array diff drop. While flat. Map foreach head init is. Instance. Of last. Index. Of. Slice map mk. String pad. To prefix. Length reduce. Right run. With segment. Length sort. With strict_== take. Right to. Buffer to. Seq transpose with. Filter zip. All ++ : and. Then build. String companion copy. To. Buffer distinct ends. With flatten generic. Builder head. Option inits is. Traversable. Again last. Index. Where max name. To. String par product reduce. Right. Option same. Elements seq sorted string. Prefix take. While to. Indexed. Seq to. Set union xml. Type zip. With. Index ++: apply can. Equal compose corresponds do. Collect. Namespaces exists fold get. Namespace index. Of intersect iterator last. Option max. By namespace partition reduce repr scan size span sum text to. Iterable to. Stream unzip xml_!= +: @ apply. Or. Else child contains count do. Transform filter fold. Left group. By index. Of. Slice is. Atom label length min non. Empty patch reduce. Left reverse scan. Left slice split. At tail the. Seq to. Iterator to. String unzip 3 xml_== /: \ as. Instance. Of collect contains. Slice descendant drop filter. Not fold. Right grouped index. Where is. Defined. At last length. Compare min. By non. Empty. Children permutations reduce. Left. Option reverse. Iterator scan. Right sliding starts. With tails to to. List to. Traversable updated xml_same. Elements /: add. String attribute collect. First copy descendant_or_self drop. Right find forall has. Definite. Size indices is. Empty last. Index. Of lift minimize. Empty or. Else prefix reduce. Option reverse. Map scope sort. By strict_!= take to. Array to. Map to. Vector view zip
For-comprehensions: similar to XQuery <bib>{ for $b in $xml/book b <- xml "book" let $year : = $b/@year = b @ "year" where $b/publisher = "Addison-Wesley" and if b "publisher" === "Addison-Wesley" && $year > 1991 return <book year="{ $year }"> year > 1991 } yield <book year={ year }> { $b/title } { b "title" } </book> }</bib>
For-comprehensions: similar to XQuery <bib>{ for $b in $xml/book b <- xml "book" let $year : = $b/@year = b @ "year" where $b/publisher = "Addison-Wesley" and if b "publisher" === "Addison-Wesley" && $year > 1991 return <book year="{ $year }"> year > 1991 } yield <book year={ year }> { $b/title } { b "title" } </book> }</bib>
For-comprehensions: similar to XQuery <bib>{ for $b in $xml/book b <- xml "book" let $year : = $b/@year = b @ "year" where $b/publisher = "Addison-Wesley" and if b "publisher" === "Addison-Wesley" && $year > 1991 return <book year="{ $year }"> year > 1991 } yield <book year={ year }> { $b/title } { b "title" } </book> }</bib>
For-comprehensions: similar to XQuery <bib>{ for $b in $xml/book b <- xml "book" let $year : = $b/@year = b @ "year" where $b/publisher = "Addison-Wesley" and if b "publisher" === "Addison-Wesley" && $year > 1991 return <book year="{ $year }"> year > 1991 } yield <book year={ year }> { $b/title } { b "title" } </book> }</bib>
For-comprehensions: similar to XQuery <bib>{ for $b in $xml/book b <- xml "book" let $year : = $b/@year = b @ "year" where $b/publisher = "Addison-Wesley" and if b "publisher" === "Addison-Wesley" && $year > 1991 return <book year="{ $year }"> year > 1991 } yield <book year={ year }> { $b/title } { b "title" } </book> }</bib>
For-comprehensions: similar to XQuery <bib>{ for $b in $xml/book b <- xml "book" let $year : = $b/@year = b @ "year" where $b/publisher = "Addison-Wesley" and if b "publisher" === "Addison-Wesley" && $year > 1991 return <book year="{ $year }"> } yield <book year={ year }> { $b/title } { b "title" } </book> }</bib> </book> Nice! }</bib> . . . yet is general purpose
Hybrid XML - XQuery for Scala - java. xml. * for free - Look up: XPath - Transform: XSLT - Stream: St. AX
XQuery for Scala (XQS) - Wraps XQuery API for Java (javax. xml. xquery) - Scala access to XQuery in: - Mark. Logic, Base. X, Saxon, Sedna, e. Xist, … - Converts DOM to Scala XML & vice versa - http: //github. com/fancellu/xqs
XQuery via XQS val widgets = <widgets> <widget>Menu</widget> <widget>Status bar</widget> <widget id="panel-1">Panel</widget> <widget id="panel-2">Panel</widget> </widgets> import com. felstar. xqs. XQS. _ val conn = new net. xqj. basex. local. Base. XXQData. Source(). get. Connection val nodes: Node. Seq = conn("for $w in /widgets/widget order by $w return $w", widgets) | Node. Seq(<widget>Menu</widget>, <widget id="panel-1">Panel</widget>, | <widget id="panel-2">Panel</widget>, <widget>Status bar</widget>)
XPath import com. felstar. xqs. XQS. _ val widgets = <widgets> <widget>Menu</widget> <widget>Status bar</widget> <widget id="panel-1">Panel</widget> <widget id="panel-2">Panel</widget> </widgets> val xpath = XPath. Factory. new. Instance(). new. XPath() val nodes = xpath. evaluate("/widgets/widget[not(@id)]", to. Dom(widgets), XPath. Constants. NODESET). as. Instance. Of[Node. List] (nodes: Node. Seq) | Node. Seq(<widget>Menu</widget>, <widget>Status bar</widget>) Natively in Scala: (widgets "widget")(widget => (widget "@id"). is. Empty) | Node. Seq(<widget>Menu</widget>, <widget>Status bar</widget>)
XSLT val stylesheet = <xsl: stylesheet xmlns: xsl="http: //www. w 3. org/1999/XSL/Transform" version="2. 0"> <xsl: template match="john"> <xsl: copy>Hello, John. </xsl: copy> </xsl: template> val people. Xml = <people> <xsl: template match="node()|@*"> <john>Hello, John. </john> <xsl: copy> <smith>Smith is here. </smith> <xsl: apply-templates select="node()|@*"/> <another>Hello. </another> </xsl: copy> </xsl: template> </people> </xsl: stylesheet> import com. felstar. xqs. XQS. _ val xml. Result. Resource = new java. io. String. Writer() val xml. Transformer = Transformer. Factory. new. Instance(). new. Transformer(stylesheet) xml. Transformer. transform(people. Xml, new Stream. Result(xml. Result. Resource)) xml. Result. Resource. get. Buffer | <? xml version="1. 0" encoding="UTF-8"? ><people> | <john>Hello, John. </john> | <smith>Smith is here. </smith> | <another>Hello. </another> | </people>
XML Stream Processing // 4 GB file, comes back in a second val src = Source. from. URL("http: //dumps. wikimedia. org/enwiki/20140402/enwiki-20140402 -abstract. xml") val er = XMLInput. Factory. new. Instance(). create. XMLEvent. Reader(src. reader) implicit class XMLEvent. Iterator(ev: XMLEvent. Reader) extends scala. collection. Iterator[XMLEvent]{ def has. Next = ev. has. Next def next = ev. next. Event() } er. drop. While(!_. is. Start. Element). take(10). zip. With. Index. foreach { case (ev, idx) => println(s"${idx+1}: t$ev") } src. close() | | | | 1: 2: <feed> 3: 4: <doc> 5: 6: 7: 8: <title> Wikipedia: Anarchism </title> 9: 10: <url> http: //en. wikipedia. org/wiki/An
Use Cases - Data extraction - Serving XML via REST - Dynamically generated XSLT - Interfacing with XML databases - Flexibility to choose the best tool for the job
Excellent Ecosystem SBT Akka Spark scalaz Spray Specs shapeless scala-xml Scaladin Scala. Test scala-maven-plugin JVM macro-paradise
Conclusion - Practical for XML processing
Where do I start? - atomicscala. com - typesafe. com/activator - scala-lang. org - scala-ide. org - Intelli. J
Matt Stephens Charles Foster
Open to consulting www. scala. contractors Follow us on Twitter: @Dino. Fancellu @Scala. William @Maff. Stephens