A 1 CyclePerByte XML Accelerator Zefu Dai Nick

A 1 Cycle-Per-Byte XML Accelerator Zefu Dai, Nick Ni and Jianwen Zhu Presented by Zefu Dai University of Toronto 2010 -2 -19 University of Toronto 1

What is XML ² Extensible Markup Language ² A Platform independent tool for data exchange and representation ² Widely used in: - Web service Database system Scientific application … 2010 -2 -19 University of Toronto 2

Performance Threat: XML Parsing ² 70 mins loading 3 GB XML file, 26 x slower than loading plain text ² >1 s per bank transaction, how many transactions per day? ² Average 175 K insts parsing 1 KB XML data (IBM XML 4 C) ² With network speed reaching tens of Gbps, XML Parsing speed outstands network improvement as the performance bottleneck 2010 -2 -19 University of Toronto 3

Previous work ² Cycle Per Byte (CPB) = Average cycle to process each byte of XML data ² Multi-core Acceleration - Require a pre-parsing process, done sequentially - 30 CPB on a 4 -core processor ² SIMD Acceleration - without in memory tree construction and validation - 6 -15 CPB ² Hardware Accelerator - Most commercial products do not reveal performance metric and design details - 10 -40 CPB 2010 -2 -19 University of Toronto 4

Our Design ² Causes of the parsing slowdown - Text-based Data Stream - Variable-length string comparison - Poor memory performance due to streaming and memory back-tracing ² An XML Parsing Accelerator implemented in FPGA - Fixed-length string operation Optimized circuits for string comparison Common case optimized stallable pipeline data structure for high bandwidth on-chip memory ² Achieve 1 CPB processing speed and saturate 1 Gbps Ethernet link, running at 125 MHz 2010 -2 -19 University of Toronto 5

Outlines ²Background ²High-level architecture ²Design Details ²Evaluation 2010 -2 -19 University of Toronto 6

Tasks of XML Parser ² Well-formed Checking - Check if the document confirms to XML syntax rules ² Schema Validation - Check if the document confirms to XML semantic rules specified in DTD or Schema files ² DOM Construction - Capture the parental relationship between elements and attributes and store them into memory in Document Object Model (DOM) format 2010 -2 -19 University of Toronto 7

Well-formed Checking example ²Has an unique root element 2010 -2 -19 University of Toronto 8

Well-formed Checking example ²Has an unique root element ²Elements must be closed and nested properly 2010 -2 -19 University of Toronto 9

Well-formed Checking example ² Has an unique root element ² Elements must be closed and nested properly ² Unique attributes within an element ²… 2010 -2 -19 University of Toronto 10

XML Schema Example ² Specify permitted child elements/attributes 2010 -2 -19 University of Toronto 11

XML Schema Example ² Specify permitted child elements/attributes ² Specify type of content 2010 -2 -19 University of Toronto 12

XML Schema Example ² Specify permitted child elements/attributes ² Specify type of content ² Specify occurrence limit ²… 2010 -2 -19 University of Toronto 13

DOM Construction ² Create in-memory tree structure for XML ² Provide application accesses through tree operations 2010 -2 -19 University of Toronto 14

Outlines ²Background ²High-level architecture ²Design Details ²Evaluation 2010 -2 -19 University of Toronto 15

Top Level Diagram 2010 -2 -19 University of Toronto 16

Top Level Diagram <Elem attr=‘xyz’> content </elem> 2010 -2 -19 University of Toronto 17

Top Level Diagram <Elem attr=‘xyz’>content</Elem> 2010 -2 -19 University of Toronto 18

Top Level Diagram <Elem attr=‘xyz’> content </Elem> 2010 -2 -19 University of Toronto 19

Top Level Diagram Elem attr xyz content 2010 -2 -19 University of Toronto 20

Top Level Diagram rule name rule content H(Elem) H(attr) Elem attr xyz content 2010 -2 -19 University of Toronto 21

Top Level Diagram rule name Elem attr rule content xyz content Elem content attr xyz 2010 -2 -19 University of Toronto 22

Outlines ²Background ²High-level architecture ²Design Details ²Evaluation 2010 -2 -19 University of Toronto 23

Recurring Idioms (Dwarfs) ²Identified 3 recurring computational idioms (referred to as Dwarfs) - One-to-one String Matching - One-to-many String Membership Test - One-to-many String Search ²One of the major reasons accounting for low performance 2010 -2 -19 University of Toronto 24

Dwarf I: One-to-one String Matching ² Tests if a subject string equals to a reference string ² Example: correct nesting ² The string is variable-length - Not efficient on conventional architecture ² Solution: memory stack - Convert variable-length string comparison to fixed-length character comparison 2010 -2 -19 University of Toronto 25

Dwarf II: One-to-many String Membership Test ² Tests if a subject string equals to any member of a set of reference strings ² Example: unique attribute within an element ² String comparison against all previously arrived attributes belonging to the same element - Expensive memory back-tracing ² Solution: Bloom Filter - achieved in one memory lookup 2010 -2 -19 University of Toronto 26

Dwarf III: One-to-many String Search ² “Finds” a subject string among a set of reference strings (different to just “test”) ² Example: Search for corresponding schema rule ² string comparison against all candidates - Undeterministic look up time ² Solution: Balance Routing Table Scheme ² Achieved in one memory lookup 2010 -2 -19 University of Toronto 27

Dwarf II: Bloom Filter ²Example: attribute name uniqueness checking ²Common case: attribute name is unique - Filter out obvious cases using Bloom Filter - Lookup into a bit array instead of compare strings ²Uncommon case: attribute name may already exists - Stall the entire design - Do all necessary string comparisons to confirm the existences of the incoming sting - Assumption: low occurring rate (high cost) 2010 -2 -19 University of Toronto 28

Solution II: Bloom Filter ²For each attribute name: - Generate N independent hash codes - Look up the bit array - Update the bit array 2010 -2 -19 University of Toronto 29

Solution II: Bloom Filter ²For each attribute name: - Generate N independent hash codes - Look up the bit array - Update the bit array 2010 -2 -19 University of Toronto 30

Solution II: Bloom Filter ²For each attribute name: - Generate N independent hash codes - Look up the bit array - Update the bit array 2010 -2 -19 University of Toronto 31

Solution II: Bloom Filter ²For each attribute name: - Generate N independent hash codes - Look up the bit array - Update the bit array Unique! 2010 -2 -19 University of Toronto 32

Solution II: Bloom Filter ²For each attribute name: - Generate N independent hash codes - Look up the bit array - Update the bit array False Positive! 2010 -2 -19 University of Toronto 33

Bloom Filter Implementation ² Implement the Bloom Filter algorithm in a pipeline - Attribute name usually has multiple characters - Allow multiple processing cycles for each attribute name 2010 -2 -19 University of Toronto 34

Outlines ²Background ²High-level architecture ²Design Details ²Evaluation 2010 -2 -19 University of Toronto 35

Experimental Setup ² Software XML parsers test Hardware and software platform Intel Core 2 Quad Q 9300 (2. 5 GHz, 6 MB L 2 Cache) 2 GB DDR 2 -800 Memory Debian Linux 2. 6. 18 -6 x 86 -64 GNU C 4. 1. 2 Tested XML parsing libraries Xerces-c 2. 8. 0 x 86 -64 Libxml 2 DOM 4 J-1. 6 JAVA API for XML Processing (JAXP) 1. 6. 0 ² XML Parsing Accelerator testbed 2010 -2 -19 University of Toronto 36

Benchmarks Group Benchmark XML Size (KB) XSD Size (KB) Source DOM Parsing Security 3 - Intel Corporation Structure 12 - codesynthesis Tpox 15 - tpox Hl 7 136 - hl 7 -testharness Qedeq 211 - qedeq. org Xmark 116, 000 - xml-benchmark Custom. Info 1 2 Intel Corporation CDCatalog 105 2 w 3 schools Workflow 13 10 qedeq. org Schema Validation 2010 -2 -19 University of Toronto 37

Test Results ² Metric: Raw Throughput (Gbps) Benchmark Security Structure Tpox Hl 7 Qedeq Xmark Average_par Custom. Info CDCatalog Workflow Average_vld Average_all 2010 -2 -19 JAXP 0. 199 0. 274 0. 292 0. 415 0. 481 0. 550 0. 373 0. 062 0. 128 0. 227 0. 161 0. 267 DOM 4 J 0. 059 0. 110 0. 099 0. 189 0. 221 0. 256 0. 158 Libxml 2 0. 294 0. 202 0. 264 0. 360 0. 338 0. 416 0. 314 0. 107 0. 232 0. 396 0. 283 0. 299 University of Toronto Xerces-c 0. 100 0. 091 0. 124 0. 128 0. 133 0. 187 0. 127 0. 054 0. 113 0. 185 0. 134 0. 131 XPA 1. 000 1. 000 XPAmax 1. 040 1. 040 38

Test Results ²Metric: Cycle Per Byte Benchmark Security Structure Tpox Hl 7 Qedeq Xmark Average_par Custom. Info CDCatalog Workflow Average_vld Average_all 2010 -2 -19 JAXP 100. 6 73. 1 68. 5 48. 2 41. 5 36. 4 53. 6 321. 8 156. 5 88. 3 124. 4 75. 0 DOM 4 J 339. 7 181. 3 201. 3 106. 0 90. 4 78. 0 126. 9 Libxml 2 67. 9 99. 1 75. 9 55. 6 59. 2 48. 0 63. 6 186. 2 86. 3 50. 4 70. 6 66. 9 University of Toronto Xerces-c 201. 0 220. 5 161. 0 155. 8 150. 6 106. 7 157. 2 373. 7 176. 8 108. 3 148. 8 152. 9 XPA 1. 0 1. 0 39

Scalability Examination ² Bloom Filter efficiency - Test Attribute Name Uniqueness circuit with generated test files - Count the number of false positives Bloom Filter 2 Hash Func. 3 Hash Fu. 2010 -2 -19 Google Key Words Wikipedia Key Words Bit_Array 4 k 8 k 16 k 64 b 1 66 509 6 129 502 256 b 0 5 60 1 8 56 1 kb 0 1 6 1 2 2 2 kb 0 0 1 0 0 0 256 b 0 0 14 1 3 9 1 kb 0 0 1 0 0 0 2 kb 0 0 0 University of Toronto 40

Implementation Cost ²Target Device: Xilinx Virtex-5 XC 5 VSX 50 T Logic Utilization XPA MC EMAC UART TOTAL XC 5 VSX 50 T 2010 -2 -19 Slice Register 4455 (13%) 1960 (6%) 927 (2%) 151 (1%) 7493 (22%) 32640 Slice LUT 6594 (20%) 1683 (5%) 712 (2%) 187 (1%) 9176 (28%) 32640 University of Toronto Block RAM 13 (11%) 5 (3%)) 3 (2%) 2 (1%) 23 (17%) 132 41

Conclusion ²FPGA is a valid contender in XML processing - Low clock frequency requirement to achieve high throughput - Scalable to process large XML documents - Moderate hardware cost to achieve high performance ²Future work - Fully conformance to XML specification 2010 -2 -19 University of Toronto 42

Questions? 2010 -2 -19 University of Toronto 43