Managing Provenance Versioning for an Evolving Dictionary in

  • Slides: 24
Download presentation
Managing Provenance & Versioning for an (Evolving) Dictionary in Linked Data Format Frances Gillis-Webber

Managing Provenance & Versioning for an (Evolving) Dictionary in Linked Data Format Frances Gillis-Webber MPhil Student, Library and Information Studies Centre, University of Cape Town LDL-2018 Workshop @ LREC 2018, Miyazaki, Japan, 12/05/2018

English-Xhosa Dictionary for Nurses A Bilingual Dictionary of Medical Terms

English-Xhosa Dictionary for Nurses A Bilingual Dictionary of Medical Terms

About the dictionary ● Published in 1935 ● Compiled by Neil Mac. Vicar, in

About the dictionary ● Published in 1935 ● Compiled by Neil Mac. Vicar, in conjunction with isi. Xhosa-speaking nurses ● In the public domain, free from any copyright restriction

isi. Xhosa / English Isi. Xhosa: Nguni language group [1], 16% as L 1

isi. Xhosa / English Isi. Xhosa: Nguni language group [1], 16% as L 1 [2] English: 9. 6% as L 1 [2] an ex-colonial language lingua franca with high status [3] [1] Doke, 1954; “Subfamily: Nguni(S. 40), n. d [2] 2011 Census, Statistics South Africa [3] Ngcobo, 2010

Digitising the dictionary Three requirements were identified when digitising: 1. It must be human-

Digitising the dictionary Three requirements were identified when digitising: 1. It must be human- and machine-readable 2. It did not have to remain an exact replica of the printed artefact 3. It must be encoded in a way which would allow it to “evolve”

Managing change 1. Versioning becomes important, particularly if the LR is integrated into another

Managing change 1. Versioning becomes important, particularly if the LR is integrated into another LR 2. Recording provenance information for each change becomes important 3. The URI strategy should allow for versioning

The URI Strategy Use Cases, Fragment Identifiers, the URI Pattern, and Resource Identifiers

The URI Strategy Use Cases, Fragment Identifiers, the URI Pattern, and Resource Identifiers

The URI use cases U 1: A URI which identifies the resource U 2:

The URI use cases U 1: A URI which identifies the resource U 2: A URI which identifies a sub-resource in relation to the parent resource U 3: A URI which identifies a version of the resource U 4: A URI which identifies a version combined with a sub-resource U 5: A URI which identifies a document describing the resource in U 1 U 6: A URI which identifies a document describing the resource in U 3

Fragment identifiers A fragment identifier is of the pattern: http: //example. com/my-uri#something Widely used

Fragment identifiers A fragment identifier is of the pattern: http: //example. com/my-uri#something Widely used in vocabularies, where “the vocabulary is often served as a document and the fragment is used to address a particular term within that document” [1] Shows a hierarchical relationship with the parent resource

The URI pattern ● The URI pattern recommended by Archer et al. (2012), and

The URI pattern ● The URI pattern recommended by Archer et al. (2012), and Gracia and Vila. Suero were evaluated: http: //{domain}/{type}/{concept}/{reference} E 1: http: //linguistic. linkeddata. es/id/apertium/lexicon. EN/bench-n-en ● But ultimately a simplified version was adopted: E 1 revised: http: //linguistic. linkeddata. es/entry/bench-n-en

U 1: A URI which identifies the resource Form: {http(s): }//{Base URI}/{Resource Path}/{Resource ID}

U 1: A URI which identifies the resource Form: {http(s): }//{Base URI}/{Resource Path}/{Resource ID} Where: ● ● {http(s): } is the http or https scheme {Base URI} is the host {Resource Path}, for eg. entry for a lexical entry, lexicon for a lexicon {Resource ID}, for eg. en-n-abdomen Example: https: //londisizwe. org/entry/en-n-abdomen

U 2: A URI which identifies a sub-resource Form: {http(s): }//{Base URI}/{Resource Path}/{Resource ID}#{Fragment

U 2: A URI which identifies a sub-resource Form: {http(s): }//{Base URI}/{Resource Path}/{Resource ID}#{Fragment ID} Where: ● {Fragment ID} is the fragment identifier, for eg. sense 1 Example: https: //londisizwe. org/entry/en-n-abdomen#sense 1

U 3: A URI which identifies a version of the resource Form: {http(s): }//{Base

U 3: A URI which identifies a version of the resource Form: {http(s): }//{Base URI}/{Resource Path}/{Resource ID}/{Version ID} Where: ● {Version ID} is the version identifier, for eg. 2017 -09 -19 Example: https: //londisizwe. org/entry/en-n-abdomen/2017 -09 -19

U 4: A URI which identifies a version combined with a sub-resource Form: {http(s):

U 4: A URI which identifies a version combined with a sub-resource Form: {http(s): }//{Base URI}/{Resource Path}/{Resource ID}/{Version ID}#{Fragment ID} Example: https: //londisizwe. org/entry/en-n-abdomen/2017 -09 -19#sense 1

U 5 & U 6: A URI which identifies a document of the resource

U 5 & U 6: A URI which identifies a document of the resource U 5: identifies a document describing the resource in U 1 {http(s): }//{Base URI}/{Document}/{Resource Path}/{Resource ID} U 6: identifies a document describing the resource in U 3 {http(s): }//{Base URI}/{Document}/{Resource Path}/{Resource ID}/{Version ID} Where: ● {Document} refers to the HTML page, for eg. page, or to the RDF representation, for eg. rdf

Resource identifiers can take two forms, both used here: ● Descriptive - for modelling

Resource identifiers can take two forms, both used here: ● Descriptive - for modelling the lexical entries and lexicons An adaptation of E 1, the resource identifier is of the form: {Language Code}-{POS}-{Lemma} ● Opaque - for modelling the lexical concepts Using a similar approach to Babelnet, for eg. 00001

Modelling Provenance & Versioning For Lexical Entries, Senses & Lexicons

Modelling Provenance & Versioning For Lexical Entries, Senses & Lexicons

Versioning The following components have been identified for versioning: ● Versioned URIs for lexicons,

Versioning The following components have been identified for versioning: ● Versioned URIs for lexicons, lexical entries, and senses (and lexical concepts) ● Provenance metadata to describe the versions, with the latest version showing the previous versions ● The generation of files, one for each version of the lexical entries and lexicons.

Provenance The W 3 C has defined provenance as: “as a record that describes

Provenance The W 3 C has defined provenance as: “as a record that describes the people, institutions, entities, and activities involved in producing, influencing, or delivering a piece of data or a thing” [1] “PROV-O”, 2013

Modelling provenance for a lexical entry : entry/xh-n-isisu a ontolex: Lexical. Entry , ontolex:

Modelling provenance for a lexical entry : entry/xh-n-isisu a ontolex: Lexical. Entry , ontolex: Word , prov: Entity ; lexinfo: part. Of. Speech lexinfo: Noun ; dct: language <http: //id. loc. gov/vocabulary/iso 639 -2/xho> , <http: //lexvo. org/id/iso 639 -1/xh> ; dct: identifier : entry/xh-n-isisu ; : entry/xh-n-isisu#sense 2 rdfs: label "isisu"@xh ; a ontolex: Lexical. Sense , prov: Entity ; ontolex: canonical. Form : entry/xh-n-isisu#lemma ; ontolex: is. Lexicalized. Sense. Of : concept/00007 ; ontolex: sense : entry/xh-n-isisu#sense 1 , : entry/xh-n-isisu#sense 2 ; dct: identifier dct: subject mesh: D 000005 ; : entry/xh-n-isisu#sense 2 ; dct: is. Part. Of : entry/xh-n-isisu ; ontolex: denotes dbr: Abdomen , dbr: Stomach ; dct: creator <https: //londisizwe. org> ; ontolex: evokes : concept/00001 ; prov: generated. At. Time "2018 -01 -10 T 05: 00 Z|+02: 00"^^xsd: date. Time ; dct: is. Part. Of : lexicon/xh ; owl: version. Info "2018 -01 -10"^^xsd: string ; dct: license <http: //creativecommons. org/publicdomain/mark/1. 0/> ; owl: same. As : entry/xh-n-isisu/2018 -01 -10#sense 2 ; prov: had. Primary. Source "The English-Xhosa Dictionary for Nurses"@en ; owl: has. Version <https: //londisizwe. org> : entry/xh-n-isisu/2018 -01 -10#sense 2. dct: creator ; prov: generated. At. Time dct: modified owl: version. Info owl: same. As owl: has. Version "2018 -01 -10 T 05: 00 Z|+02: 00"^^xsd: date. Time ; "2018 -01 -10"^^xsd: date ; "2018 -01 -10"^^xsd: string ; : entry/xh-n-isisu/2018 -01 -10 ; : entry/xh-n-isisu/2017 -09 -19 , : entry/xh-n-isisu/2018 -01 -10.

Modelling provenance for a lexicon : lexicon/xh a lime: language dct: identifier : lexicon/xh/2018

Modelling provenance for a lexicon : lexicon/xh a lime: language dct: identifier : lexicon/xh/2018 -01 -12 lime: lexical. Entries a lime: linguistic. Catalog lime: Lexicon , void: Dataset , prov: Entity , prov: Dictionary , prov: Collection ; "xh" ; <http: //id. loc. gov/vocabulary/iso 639 -2/xho> , <http: //lexvo. org/id/iso 639 -1/xh> ; : lexicon/xh ; "1"^^xsd: integer ; prov: Dictionary. ; <http: //www. lexinfo. net/ontologies/2. 0/lexinfo> dct: description "Londisizwe. org - isi. Xhosa lexicon"@en ; : lexicon/xh/2018 -01 -15 dct: creator <https: //londisizwe. org> ; a prov: Dictionary ; prov: generated. At. Time "2018 -01 -15 T 06: 00 Z|+02: 00"^^xsd: date. Time ; prov: derived. By. Removal. From dct: modified "2018 -01 -15"^^xsd: date ; : lexicon/xh/2018 -01 -12 ; prov: qualified. Removal [ owl: version. Info "2018 -01 -15"^^xsd: string ; a owl: same. As : lexicon/xh/2018 -01 -15 ; prov: Removal ; prov: dictionary, : lexicon/xh/2018 -01 -12 ; owl: has. Version : lexicon/xh/2017 -09 -19 , : lexicon/xh/2018 -01 -15 ; prov: removed. Key "xh-n-ulusu_lomntu"^^xsd: string ; dct: references : lexicon/en ; ]; void: data. Dump <https: //londisizwe. org/data/xh-lexicon/2018 -01 -15>. .

Future Work

Future Work

londisizwe. org 2018: ● To continue with the human-readable view ● To continue publishing

londisizwe. org 2018: ● To continue with the human-readable view ● To continue publishing the lexical entries derived from the original dataset ● To add SASL as another language, changing the resource from bilingual to multilingual 2019: ● To continue working with the lexical concepts, using machine translation, and crowdsourcing and gamification techniques to evolve the resource further

Thank you!

Thank you!