CSVX A Linked Data Enabled Schema Language Model

CSV-X A Linked Data Enabled Schema Language, Model, and Processing Engine for Non-Uniform CSV 2016 IEEE International Conference on Internet of Things (i. Things) 1 IEEE Green Computing and Communications (Green. Com) IEEE Cyber, Physical and Social Computing (CPSCom) IEEE Smart Data (Smart. Data)

2 Author Wirawit Chaochaisit Ken Sakamura Masahiro Bessho

3 INTRODUCTION Linked Open Data(LOD) The merits of Open Data growing up (The highest level) Using RDF and SPARQL Linking the data together using IRIs making the data globally discoverable extending possible queries to the linked entities as some major advantages But only a small proportion of open datasets adhere to the principles due to the high cost of human effort in converting from original data formats having either technical or financial difficulties in preparing data as XML or RDF

4 INTRODUCTION Open Data Institute (ODI) Open Data Barometer Global Report from World Wide Web Foundation CSV is still among one of the most popular machine readable formats in use But lacking the ability to explicitly express more complex data structure like hierarchy to specify data type, value restriction, semantics, or link to other data → numbers of schema languages, tools, and standard to aid in describing, annotation, validation, and transformation to RDF

5 INTRODUCTION Nevertheless, existing solutions cannot be used for, or only partially support, CSV with arbitrary structure and semantic relations between values →non-uniform CSV the tool built to convert CSV to RDF only can handle a limited variations of CSVs Most of the tools only expect CSV to be strictly RFC 4180 compliant as will be explained in Related Work Section

6 INTRODUCTION CSV-X a flexible schema language, model, and processing engine for non-uniform CSV inspired by JSON-LD and W 3 C CSV on the Web (CSVW) defining a tabular-based schema model to describe CSV’s values, structure, relations, and metadata flexible schema constructs, adaptive matching algorithms, dynamic variable declaration and cross- reference techniques automatic serialization to RDF via native mapping between schema entities and their properties to IRI

7 MOTIVATION: CSV REVISIT CSV lacking the supports for data type, validation, and structural expression, etc the virtues of being simple data format with inherent tabular nature as in relational database more compact in size compared to other formats like XML, JSON, and RDF more preferable due to its semi-structure data model, explicit semantic definition, and linking to other data to perform integrated analysis from multi-source of data to make CSV a smart data for the majority of open data community to benefit

8 PROBLEM DEFINITION: THE NONUNIFORM CSV those deviated from RFC 4180 memo are regarded as non-uniform CSVs define two kinds of non-uniform CSV, the one with syntactic differences and the one with semantic differences

9 PROBLEM DEFINITION: THE NONUNIFORM CSV syntactic differences like delimiters, escape characters, structure, and etc. which are referred to as “CSV dialects” in CSVW specification semantic differences a variance of CSV that is more than meets the eye in the encoded semantic relations among values maintain the minimum requirements for syntax specification of having value separated by two kinds of delimiter

10

11 PROBLEM DEFINITION: THE NONUNIFORM CSV the non-uniform CSV patterns of semantic differences type to describe such patterns for a specifiable region the key to describe variety of encoding patterns the basic of CSV’s elements selection from its most atomic level(cell) to range of cells and the whole table → an schema/parser shouldn't limit how CSV data may be encoded and interpreted by just a fixed set of pattern

12 CSV-X SCHEMA AND PROCESSING ENGINE A. Design Principles and Rationales Versatility in Handling Non-Uniform CSV Minimized Complexity Simplify the RDFization process RDF is built from a generic data structure called triple consisting of subject, predicate, object where each entity is described using IRI, a. k. a. vocabulary data publisher to remodel their original data structure as triple if it's not based on RDF, thinking of IRI namespace, and to survey, reuse, or align with existing vocabularies that suit the domain

13 CSV-X SCHEMA AND PROCESSING ENGINE A. Design Principles and Rationales RDF Model Abstraction Flexible Mapping to RDF templates the task of modeling an RDF is separated from schema model while allowing customization and reuse of RDF pattern in the template

14 CSV-X SCHEMA AND PROCESSING ENGINE B. CSV-X Components and General Flow

15 CSV-X SCHEMA AND PROCESSING ENGINE C. CSV-X Schema Model

16 CSV-X SCHEMA LANGUAGE A. Schema Encoding with in JSON-LD an RDF serialization based on JSON CSVW which bases its syntax on JSON-LD CSV-X Providing metadata and expressions which abstract away RDF model and let user focus on composing the schema extending JSON syntax with an extra allowances

17 CSV-X SCHEMA LANGUAGE B. Metadata and Expressions : CSV-X by Example

18 CSV-X SCHEMA LANGUAGE B. Metadata and Expressions : CSV-X by Example @base property specifies base IRI for all non-IRI or relative IRI property’s key and value of @id @prefixes definition as key-value maps between namespace and IRI @target. CSVs specifies an array of CSV file(s) the schema is describing

19 CSV-X SCHEMA LANGUAGE B. Metadata and Expressions : CSV-X by Example 1) Describing CSV: Schema Entity Expressions

20

21 CSV-X SCHEMA LANGUAGE B. Metadata and Expressions : CSV-X by Example 2) User’s Property Annotation

22 CSV-X SCHEMA LANGUAGE B. Metadata and Expressions : CSV-X by Example 3) Values, Datatypes, and Structural Validation

23 CSV-X SCHEMA LANGUAGE B. Metadata and Expressions : CSV-X by Example 4) Variable and Dynamic Cross-referencing

24 CSV-X SCHEMA LANGUAGE B. Metadata and Expressions : CSV-X by Example 5) Context Variable and Dynamic Declaration

25 CSV-X SCHEMA LANGUAGE B. Metadata and Expressions : CSV-X by Example 6) Empty Cell, Empty String and Missing Value

26 CSV-X SCHEMA LANGUAGE C. CSV-to-RDF Conversion and Beyond 1) Native RDF Mapping the parsed CSV can be automatically translated into an RDF format based on CSV -X schema model The fundamental of RDF mapping starts from serializing every schema entity and its properties into locally unique strings Schema Entity Reference Expression (SERE)

27

28 CSV-X SCHEMA LANGUAGE C. CSV-to-RDF Conversion and Beyond 2) Parameterizable Transformation Template

29 CSV-X PROCESSING ALGORITHMS

30 CSV-X PROCESSING ALGORITHMS

31 IMPLEMENTATION implemented in Java with features to support parsing, annotation, validation, crossreferencing, RDF serialization, and transformation developed a live demo web interface for anyone to easily experiment with the engine. utilizes univocity-parsers 2. 2. 2 as a base for both uniform and non-uniform CSV/TSV parsing and extends it for Space-Separated Value Current serialization can be further processed using available RDF tools to convert it into other format/syntax if desired

32 EVALUATION

33 RELATED WORK UK National Archives CSV Schema Language 1. 1 W 3 C CSV on the Web (CSVW) SCULPT Tab. Linker CSV 2 Data. Cube csv 2 rdf 4 lod XLWrap

34 CONCLUTION Thanks to its flexible, abstract language constructs, it enables annotation, validation, cross-referencing, and transformation of tabular data like CSV to more advanced data format like RDF the language and processing engine will help lowering the difficulties in publishing high quality data for the open data community and general users alike
- Slides: 34