Exposing legacy filebased data interopforfiles Andrew Woolf CCLRC
Exposing legacy file-based data (interop-for-files) • Andrew Woolf • CCLRC Rutherford Appleton Laboratory • A. Woolf@rl. ac. uk AUKEGGS Canberra, 2006 -11 -29
Outline • • Introduction The feature model as integration key An interoperability approach for files xlink review and proposed profile for legacy data • Examples • Issues AUKEGGS Canberra, 2006 -11 -29
Introduction • Much ‘earth-science’ data exists as large legacy file-stores – e. g. ECMWF: 2 Pb of file-based data – e. g British Atmospheric Data Centre: 40 Tb of filebased data • Interoperability demands common approaches • BUT, multitude of formats masks commonality – net. CDF, HDF 4, HDF 5, GRIB, NASA Ames, PP, . . . AUKEGGS Canberra, 2006 -11 -29
Introduction • File-centred data management focusses on the container rather than content • File API is fundamental point of reference – binary format details not always exposed or guaranteed – public API may be only supported access mechanism – often implemented as performant optimised native library • Conclusion: can’t/shouldn’t migrate AUKEGGS Canberra, 2006 -11 -29
Introduction • Want to expose information, not format. . . AUKEGGS Canberra, 2006 -11 -29
Introduction • Information structures may be composed across files AUKEGGS Canberra, 2006 -11 -29
The feature model • Common pattern with file-data: – need to integrate information structures across multiple files – (relational tables provide this implicitly) • Semantics provide an integration key – e. g. an oceanographer and meteorologist can share a conversation about data despite format differences AUKEGGS Canberra, 2006 -11 -29
The feature model AUKEGGS Canberra, 2006 -11 -29
A model for file-based interoperability • Retain file-based persistence format • Supplement with feature-based conceptual model • ‘Cast’ legacy data onto conceptual model – interoperable. Data = (feature. Model) legacy. Data • Legacy file data + GML-encoded conceptual ‘metadata’ = ‘interoperable view’ – may be exposed through W*S AUKEGGS Canberra, 2006 -11 -29
A model for file-based interoperability • GML provides conceptual feature ‘skeleton’ • File provides ‘flesh’ • GML ‘by-reference’ pattern for property values – uses simple xlink – “The value of a GML property that carries an xlink: href attribute is the resource returned by traversing the link” AUKEGGS Canberra, 2006 -11 -29
xlink review extended xlink [role] [title] remote resource B [href] [role] [title] [label] local resource A [role] [title] [label] remote resource C [href] [role] [title] [label] arc 1 [arcrole] [title] [show] [actuate] arc 2 local resource D [role] [title] [label] arc 3 AUKEGGS Canberra, 2006 -11 -29
xlink review simple xlink [role] [title] remote resource [href] [role] [title] [label] arc [arcrole] [title] [show] [actuate] local resource [role] [title] [label] AUKEGGS Canberra, 2006 -11 -29
xlink review • ‘role’ (URI): – indicates a property of the remote resource – must be a URI reference that “identifies some resource that describes the intended property” • ‘arcrole’ (URI): – describes the “meaning of the arc’s ending resource relative to its starting resource” – corresponds to RDF notion of a property • starting-resource HAS arc-role ending-resource AUKEGGS Canberra, 2006 -11 -29
xlink patterns for files extended xlink GML feature instance Aggregation semantics determined by xlink arc traversal rules AUKEGGS Canberra, 2006 -11 -29
xlink patterns for files simple xlink GML feature instance Aggregation semantics determined by storage descriptor AUKEGGS Canberra, 2006 -11 -29
xlink proposal <some. GMLElement xlink: arcrole="has. Remote. Content. Embedded. At#local. Xpath" xlink: href="storage. Descriptor#portion" xlink: role="storage. Schema. Identifier" xlink: show="embed" xlink: actuate="on. Request | on. Load"/> • href examples: – – net. CDF#variable RDBMS#SQLQuery GRIBFile#record. Number CSMLStorage. Descriptor#array. ID AUKEGGS Canberra, 2006 -11 -29
Example • GML CR 06 -160 – ISO 19123 CV_Referenceable. Grid <gml: Referenceable. Grid gml: id="ID 001" srs. Name="urn: ogc: def: crs: EPSG: 6. 6: 4326 " dimension="2"> <gml: limits> <gml: Grid. Envelope> <gml: low>0 0</gml: low> <gml: high>7 4</gml: high> </gml: Grid. Envelope> </gml: limits> <gml: axis. Labels>x y</gml: axis. Labels> <gml: coord. Transform. Table> <gml: Grid. Coordinates. Table> <gml: grid. Ordinate> <gml: Grid. Ordinate. Description> <gml: coord. Axis. Label>Geodetic longitude</gml: coord. Axis. Label> <gml: coord. Axis. Values> < gml: Spatial. Or. Temporal. Position. List> < gml: coordinate. List>13. 5 24. 9 32. 4 37. 7 41. 5 46. 8 54. 4 65. 7 </gml: coordinate. List> </ gml: Spatial. Or. Temporal. Position. List> </gml: coord. Axis. Values> <gml: grid. Axes. Spanned>x</gml: grid. Axes. Spanned > <gml: sequence. Rule axis. Order="+1">Linear</gml: sequence. Rule> </gml: Grid. Ordinate. Description> </gml: grid. Ordinate> <gml: Grid. Ordinate. Description> <gml: coord. Axis. Label>Geodetic latitude</gml: coord. Axis. Label> <gml: coord. Axis. Values> < gml: Spatial. Or. Temporal. Position. List> < gml: coordinate. List> 53. 1 48. 7 46. 2 44. 7 43. 9 43. 3 43. 1 44. 0 46. 2 43. 2 41. 5 40. 6 40. 2 40. 0 40. 3 41. 7 37. 1 36. 1 35. 6 35. 5 35. 7 36. 0 37. 1 39. 5 30. 4 30. 2 30. 4 30. 7 31. 1 32. 0 33. 8 37. 2 24. 3 24. 8 25. 3 26. 0 26. 6 27. 7 29. 7 33. 4 </gml: coordinate. List> </ gml: Spatial. Or. Temporal. Position. List> </gml: coord. Axis. Values> <gml: grid. Axes. Spanned>x y</gml: grid. Axes. Spanned > <gml: sequence. Rule axis. Order="+1 -2">Linear</gml: sequence. Rule> </gml: Grid. Ordinate. Description> </gml: grid. Ordinate> </gml: Grid. Coordinates. Table> </gml: coord. Transform. Table> </gml: Referenceable. Grid> AUKEGGS Canberra, 2006 -11 -29
Example • net. CDF ASCII dump: netcdf myfile { dimensions: x=8; y=5; variables: float lon(x) ; lon: long_name = “longitude” ; lon: units = “degrees_east” ; float lat(x, y) ; lat: long_name = “latitude” ; lat: units = “degrees_north” ; float temp(x, y) ; temp: coordinates = “lon lat” ; temp: long_name = “temperature” ; temp: units = “deg. C” ; data: lon = 13. 5, 24. 9, 32. 4, 37. 7, 41. 5, 46. 8, 54. 4, 65. 7 ; lat = 53. 1, 48. 7, 46. 2, 44. 7, 43. 9, 43. 3, 43. 1, 44. 0, 46. 2, 43. 2, 41. 5, . . . AUKEGGS Canberra, 2006 -11 -29
Example <gml: grid. Ordinate> <gml: Grid. Ordinate. Description> <gml: coord. Axis. Label>Geodetic longitude</gml: coord. Axis. Label> <gml: coord. Axis. Values> <gml: Spatial. Or. Temporal. Position. List> <gml: coordinate. List srs. Name=“WGS 84”>13. 5 24. 9 32. 4 37. 7 41. 5 46. 8 54. 4 65. 7</gml: coordinate. List> </gml: Spatial. Or. Temporal. Position. List> </gml: coord. Axis. Values> <gml: grid. Axes. Spanned>x</gml: grid. Axes. Spanned > <gml: sequence. Rule axis. Order="+1">Linear</gml: sequence. Rule> </gml: Grid. Ordinate. Description> </gml: grid. Ordinate> <gml: coord. Axis. Values xlink: arcrole=“http: //ndg. nerc. ac. uk/xlink. Usage/insert#Spatial. Or. Temporal. Position. List/coordinate. List” xlink: href=“myfile. nc#lon” xlink: role=“http: //ndg. nerc. ac. uk/file. Format/netcdf” xlink: show=“embed”> <gml: Spatial. Or. Temporal. Position. List> <gml: coordinate. List srs. Name=“WGS 84”/> </gml: Spatial. Or. Temporal. Position. List> AUKEGGS </gml: coord. Axis. Values> Canberra, 2006 -11 -29
Issues • Need to ‘get as close as possible’ to target – ‘merge’ semantics consistent with GML? (Opportunity: no best practice for GML yet!) • “If both a link and content are present in an instance of a property element, then the object found by traversing the xlink: href link shall be the normative value of the property. The object included as content shall be used by the data recipient only if the remote instance cannot be resolved; this may be considered to be a "cached" version of the object. ” [GML 7. 2. 3. 4] AUKEGGS Canberra, 2006 -11 -29
Issues • xlink: href (URI) for remote resource fragment (formatspecific) – e. g. RDBMS#SQLQuery, net. CDF#variable, etc. . . • xlink: role (URI) for resource format – e. g. reference PRONOM-type format repository? • implied conversion to GML target content type • xlink: arcrole (URI) for ‘embed remote content’ semantics – ‘insert at relative XPath’ essential • simple xlink can’t handle multiple resources – application-specific ‘storage descriptor’ schemas for file aggregation semantics AUKEGGS Canberra, 2006 -11 -29
Conclusion • Presented a profile for xlink with files in absence of current best practice • Meets key practical requirements – retain file-based persistence formats – provide interoperability ‘wrapper’ – focus on logical content, not container (feature model) • Semantic governance at appropriate points • Enables powerful, scalable mechanism for real data – e. g. large meteorological datasets AUKEGGS Canberra, 2006 -11 -29
- Slides: 22