Geospatial Data Integration Isabel F Cruz Department of
Geospatial Data Integration Isabel F. Cruz Department of Computer Science University of Illinois at Chicago http: //www. cs. uic. edu/~advis/publications/ http: //www. cs. uic. edu/~ifc/grants/DG/ http: //www. cs. uic. edu/~advis/CASSIS/ ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Overview Introduction, motivation, and some definitions n Semantic Integration n Ontology roles n Scenarios (geospatial and non-geospatial) n Ontology alignment for semantic heterogeneities n n Research Issues and Discussion ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Multiplicity of Data Sources n From “MONO- to MULTI-” environment [Backe and Edwards]: n n n multi-sources multi-sensors multi-producers multi-representation multi-answer ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Data Heterogeneity [Bishr 99] n Syntactic heterogeneity n n Schematic heterogeneity n n Different paradigms (e. g. , Relational, XML, and RDF) Different aggregation or generalization hierarchies for the same “real world” facts Semantic heterogeneity n Disagreement on the meaning, interpretation or intended use of data ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Syntactic Heterogeneity <actors> <actor name=“B. del Toro”> <films> <film title=“ 21 Grams”/> <film title=“Traffic”/> </films> </actors> Actor Film B. del Toro 21 Grams B. del Toro Traffic Data sources may use different syntax to represent data. ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Schematic Heterogeneity <actors> <actor name=“B. del Toro”> <films> <film title=“ 21 Grams”/> <film title=“Traffic”/> </films> </actors> <film title=“ 21 Grams”> <actor name=“B. del Toro”/> </film> <film title=“Traffic”> <actor name=“B. del Toro”/> </films> Documents can contain the same element and attribute names but have different nested structures. ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Semantic Heterogeneity <stars> <star name=“Betelgeuse”> <distance>“ 425 light years ” </distance> <luminosity from=“ 40000” to=“ 100000”/> </stars> <star name=“Eva Gardner”> <born>“ 1922 -12 -24” </born> <died>“ 1990 -01 -25” </died> </stars> Documents can have the same names for elements and attributes but different meanings. ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Data Integration Data integration: ability to manipulate (e. g. , query) data transparently across multiple heterogeneous data sources n Semantic data integration: based on conceptual representation of the data and their relationships to eliminate possible syntactic, schematic and semantic heterogeneities n n Note: Semantic data integration can be used to solve syntactic and schematic heterogeneities! ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Ontology-Based Data Integration Application Query Mediator Ontology Wrapper Local Ontology Source [Fonseca & Egenhofer 99] ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Ontology n n An ontology is an explicit specification of a shared conceptualization RDF (Resource Description Framework) n n RDF Schema: A language that is used to describe vocabularies of RDF data n n A directed graph of statements: (resource, property, value) rdfs: Class, rdf: Property, rdfs: domain, rdfs: range, etc. DAML+OIL and OWL rdfs: domain Flying-object Aircraft Measure place name Maintenance airbase time number staff rdf: Property rdfs: sub. Property. Of rdfs: sub. Class. Of Combat-aircraft Person rdfs: Class rdfs: range ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Semantic Web: Architecture (Berners-Lee http: //www. w 3. org/2000/Talks/1206 -xml 2 k-tbl/) Trust Digital Signature Rules Proof Logic Data Ontology Vocabulary Data Self – described document RDF + rdfschema XML + NS + xmlschema Unicode ADVIS Lab – http: //www. cs. uic. edu/~advis URI December 3, 2004
Ga. V and La. V n n n Global schema consists of views over local schemas Global querying is easy – subquery unfolding Global schema maintenance is difficult Global Schema query n n n Each local source corresponds to a query over the global schema Global querying is difficult – inference over partial answers Global schema maintenance is easy Global Schema MAPPING query Local Source Local Source ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Role of Ontologies in Data Integration Role 1: Schema Annotation n Role 2: High-level View of Sources n Role 3: Support for High-level Queries n Role 4: Declarative Mediation n Role 5: Support for Inference n ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Application Scenario 1 n n Data interoperation between legacy systems: System B and System E A typical query: List all the “F 15” aircrafts System E System B Table: RDYACFT MODEL AVAILTIME QTY AIRBASE S_ID F 15 0800 12 CA, Anaheim 1214 F 16 1000 13 GA, Dalton 1215 Table: STAFF S_ID TITLE TEAM_LEADER STAFF_NUM 1214 F 15_team Johnson 6 1215 F 16_team Michael 5 AIRCRAFT. DTD <? xml version="1. 0" encoding="UTF-8"? > <!ELEMENT AIRCRAFT_SCHEDULE (AIRCRAFT)*> Foreign key <!ELEMENT AIRCRAFT (NUMBER, RDYTIME, AIRBASE, MTSTAFF)+> <!ATTLIST AIRCRAFT NAME CDATA #REQUIRED> <AIRCRAFT NAME="F-15"> <!ELEMENT NUMBER (#PCDATA)> <NUMBER> 5 </NUMBER> <!ELEMENT RDYTIME (#PCDATA)> <RDYTIME> 11: 00 am </RDYTIME> <!ELEMENT AIRBASE (#PCDATA)> <AIRBASE> Anaheim, CA </AIRBASE> <!ELEMENT MTSTAFF (#PCDATA)> <MTSTAFF> Eagle-1 </MTSTAFF> </AIRCRAFT> ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
A Layered Model [Melnik 00] Local Source Remote Source Application Layer User Interface Semantic Layer Language Our focus Language Domain Models Conceptual Models Object Layer (Objects/Relationships between objects) Syntax Layer (Serialization, Storage) ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Implementation of Layers n Application Layer n n n Provides user interface and accepts user queries Semantic Layer (Our focus) n Conceptual Models: Model concepts, relationships and constraints. RDF Schema n Domain Models: Express the ontologies of a particular application domain. Global ontology for the domain of aircraft maintenance n Languages: Express the queries. RDF Query Language (RQL) Object Layer and Syntax Layer. RDF Schema Specific Database (RSSDB) ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Architecture [Cruz 03 a] ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Three Phases n Constructing the Local Unified Schema n n n Mapping Process n n n Schema transformation (Schema Integrator) Relational schemas XML schemas (DTD) Local unified schema RDF schemas Data transformation The global ontology: used by the mediator Common vocabulary facilitates the mapping between the global ontology and local unified schemas in the La. V approach Query Processing n n Query rewriting algorithm Based on the mapping information ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Ontology Role 1 – Schema Annotation n n Annotation (or abstraction) of the schema of a local relational, XML, or RDF source Conceptualizing the elements and relationships between elements n n A uniform metadata representation that facilitates the mapping process Addition and/or preservation of the schema features, such as key information and XML document structure n A requirement for correct query answering ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Ontology Role 1 – Example 1 n Rules for schema annotation for relational sources Relational schema n RDF ontology Attribute Property Relation Class Translation from Relational to RDFS. System B Table: RDYACFT MODEL AVAILTIME QTY AIRBASE S_ID F 15 0800 12 CA, Anaheim 1214 F 16 1000 13 GA, Dalton 1215 Table: STAFF S_ID TITLE TEAM_LEADER STAFF_NUM 1214 F 15_team Johnson 6 1215 F 16_team Michael 5 Literal MODEL AVAILTIME Literal RDYACFT S_ID Literal AIRBASE Literal QYT Literal S_ID Literal TITLE STAFF_NUM Literal TEAM_LEADER Literal local relational source ADVIS Lab – http: //www. cs. uic. edu/~advis local RDF description December 3, 2004
Mapping Process and Query Processing The global ontology Common Vocabulary User query mapping Local unified schema … Query answer … Relational database n n … XML document source … Schema and data transformation RDF data source Mapping Mediation uses Common Vocabulary Query Processing Mediation uses the Global Ontology ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Mapping rdfs: domain Schema for System E NAME AIRCRAFT RDYTIME AIRBASE Maintenance Global Ontology name time Schema for System B MODEL NUMBER AVAILTIME number MTSTAFF mapping airbase RDYACFT S_ID QTY AIRBASE title STAFF TITLE STAFF_NUM Synonym of READYTIME ADVIS Lab – http: //www. cs. uic. edu/~advis TEAM_LEADER December 3, 2004
Application Scenario 2 papers writer* paper* @title author* article* @name @title Local XML Source S 1 n n @fullname Local XML Source S 2 Goal: Integrating heterogeneous XML sources, enabling interoperation among them S 1 and S 2 n n Semantically equivalent: Two concepts – paper (or article) and author (or writer), and a relationship between them Structurally different: Reverse nesting structures ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Schema Annotation n Rules for schema annotation for XML sources XML schema n RDF ontology Attribute Property Simple-type element Property Complex-type element Class A new RDFS property rdfx: contains is used to represent relationships between classes. Books rdfx: contains book* Literal title Book author* @title @name local XML schema S 1 ADVIS Lab – http: //www. cs. uic. edu/~advis rdfx: contains Literal name Author local RDF description S 1' December 3, 2004
Ontology Role 1 – Example 2 n Rules for schema annotation for XML sources XML schema n RDF ontology Attribute Property Simple-type element Property Complex-type element Class A new RDFS property rdfx: contains is used to represent class-to-class relationship. Books rdfx: contains book* Literal title Book author* @title @name local XML schema S 1 ADVIS Lab – http: //www. cs. uic. edu/~advis rdfx: contains Literal name Author local RDF description S 1' December 3, 2004
Ontology Role 2 – High-level View of Sources n n The global ontology is generated by integrating the RDF annotations S 1' and S 2' (of local XML sources S 1 and S 2) and the RDF schema S 3 Using the Ga. V approach A high-level overview of local sources with explicit semantics The user does not need to formulate a query according to the document structure of a particular XML source ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Ontology Role 2 – Example Inter-schema Mapping Books Literal title name Writers rdfx: contains Book Literal rdfx: contains Author Literal local RDF ontology S 1' fullname title rdfx: contains Writer rdfx: contains Article local RDF ontology S 2' Literal ISBN name rdfx: contains Books Authors rdfx: contains title published. By name Publisher Literal RDF-based global ontology ADVIS Lab – http: //www. cs. uic. edu/~advis published. By Book booktitle Literal Publisher ISBN Literal name Literal local RDF ontology S 3 December 3, 2004
Application Scenario 3 n n WLIS (Wisconsin Land Information System): web-based system linking data from distributed, heterogeneous data sources Case study: land use codes Sample query: “Find all the agricultural lands in Dane and Racine counties. ” Different authorities use different land use coding systems leading to syntactic, schematic, and semantic heterogeneities ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Heterogeneity “Find all the agricultural lands in Dane and Racine counties. ” Parcel-based example Each highlighted parcel has its own land use classification code ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Heterogeneity in WLIS Land Use Code Land Use Code There are 72 counties and hundreds of cities and towns in the state; each may have their own system of classifying Land Use codes ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Classification: Semantic Issue Dane County Racine County Commercial Retail Sales and Services Retail Sales Retail Services Intensive ADVIS Lab – http: //www. cs. uic. edu/~advis Nonintensive Land Under Development December 3, 2004
Land Use Codes Planning Authority Attribute Land Use Code Description Dane County RPC Lucode 91 Cropland Pasture Racine County (SEWRPC) Tag 811 Cropland 815 Pasture and Other Agriculture Eau Claire County Lu 1 AA General Agriculture City of Madison Lu_4_4 8110 Farms ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Land Use Codes Planning Authority Attribute Land Use Code Description Dane County RPC Lucode 91 Cropland Pasture Racine County (SEWRPC) Tag 811 Cropland 815 Pasture and Other Agriculture Eau Claire County Lu 1 AA General Agriculture City of Madison Lu_4_4 8110 Farms Synonyms ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Land Use Codes Planning Authority Attribute Land Use Code Description Dane County RPC Lucode 91 Cropland Pasture Racine County (SEWRPC) Tag 811 Cropland 815 Pasture and Other Agriculture Eau Claire County Lu 1 AA General Agriculture City of Madison Lu_4_4 8110 Farms Synonyms ADVIS Lab – http: //www. cs. uic. edu/~advis Value heterogeneity December 3, 2004
Agreement Document n n n XML document that act as a wrapper layer for the underlying local data source Stores information about how entities in the global ontology map to the entities in the local data source Uses XML to capture the hierarchical ordering of entities and their mappings Supports query operations using XPath/XSLT to hide details of how data is structured in local data source Minimizes need for programmer intervention and maintenance as it is declaratively specified ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Ontology Alignment n n n Alignment is the process of mapping concepts from one ontology to concepts of another ontology Concepts are mapped based on how “similar” they are to each other Similarity takes different shapes: n Similarity in definition n n For example, automobile and car have very similar definition in any given dictionary Similarity in text n For example: agriculture and agricultural have the same prefix and have 4 letters in common ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Mapping types Exact: the connected vertices equivalent in meaning n Subset: the vertex in the global ontology is a subset of the vertex in the local ontology, i. e. less general in meaning. n Superset: the vertex in the global ontology is a superset of the vertex in the local ontology, i. e. more general in meaning n Approximate: the connected vertices are close in meaning (e. g. , they intersect in some properties) but are not equivalent in definition. n Null: the vertex in the global ontology does not have an equivalent vertex in definition in the local ADVISontology Lab – http: //www. cs. uic. edu/~advis December 3, 2004 n
Mapping Types Exact Industry Mining Industry Exact Manufacturing Production Mining Exact Rubber Construction Material Electrical Supplies Rubber and Glass Superset Subset ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Agreement Maker Visual interface for creating agreements n Existing mappings displayed to the user n Displayed list of mappings updated as user identifies more mappings n ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
User Interface ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Semi-automatic Alignment n Framework that defines the values associated with the vertices of the ontology as functions of the: n values of the children vertices, or n user input User (or system) establishes some mapping types n System propagates the mapping types along the ontologies (bottom-up) as much as possible n ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Full vs. Partial Mappings Superset a b d e c f Exact Superset Full Mapping ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Full vs. Partial Mappings Subset a b d e c f g Exact Partial Mapping ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Propagation Rules ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Deduction Results Interface ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Conclusions n n n Ontologies carry explicit semantics with concepts and relationships between concepts in a knowledge domain XML or relational schema languages encode the semantics implicitly in the schema structure, e. g. , the XML nested structure In traditional or P 2 P schema-based data integration, ontologies may be used to add semantics to the local schemas, so as to facilitate the interoperation between heterogeneous data sources n n n In the mapping process In query answering Five roles of ontologies in data integration Considered all kinds of heterogeneities Looked at geospatial and non-geospatial applications ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Research Questions n Standards n What problems do they address and in what degree? n n Syntax Schematic Semantics New applications (LBS), new architectures (P 2 P), new technologies (sensor networks) n n How do they affect what we already know about data interoperation? How to extend what we already know? ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Which Techniques Solve Fundamental/Practical Issues in Data Integration? Intelligent Software Agent Technology for distributed GIServices [Tsou] n Data Mining [Shekhar, Gahegan] n Middleware [Armstrong and Wang] n Geospatial Web Services and Grid Services [Di] n … n ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Context and Ontologies [Bishr] n n n Context: Collection of relevant conditions and surroundings that make a situation unique and comprehensible No context-independent facts Can provide for simplification of: n n n Axiom formalization Ontologies Theory of geospatial context must consider the role of time, location, and other spatio-temporal aspects in determining the truth value of a given set of axioms ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Context and Ontologies [Bishr] Semantic interoperability in geospatial applications can only be achieved if we introduce context into our ontological models n Capturing the difference between the system’s view and the user’s view and extending it to the use of the same concept in different contexts (working with vector spaces) n Context-augmented ontology (ontology changes based on user or device profile, role or task) and querying [Cruz] n ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Beyond (Traditional) Metadata [Kuhn] n n n Metadata is not enough for data interoperability Operations need to be defined, e. g. , as expressions built around service signatures (that define operations) Have measurable interoperability criteria Current efforts are not yet successful, e. g. , service semantics have no sound formal basis in today’s semantic web Another related (? ) approach: geospatial behavior [Sen] Need spatio-temporal concepts [Bishr, Stefanidis] ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Geospatial Semantics [Kuhn] n n Geospatial data and services are not only convention: observable grounding in the physical world Elaborate Measurement Ontologies are required Based on human perception and social agreements Geographic names and other identifiers of geospatial entities n n n Name registries need better translation and geo-referencing capabilities Processes, not entities (see also [Sen] and verbs) Vagueness and different levels of granularity are fundamental (need coarser grain) ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Ontology Alignment [Cruz] n n One of the cornerstones of data interoperability for bridging between concepts When related to a single central ontology the process becomes one-to-one (more manageable). However: n n n Many fine details, many options, many difficulties (e. g. , appropriate testing, automation) La. V vs. Ga. V Standards Integration in query processing Multiple themes ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Data Interoperability: Beyond Data Integration n Product chains [Frew] n One scientist’s product (e. g. , snow cover maps) becomes another scientist’s input (to a runoff forecast model) Product chain Spatial Data Mining Systems [Shekhar] (exchange patterns, aggregate results, without exchanging the large datasets) n … n ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Minimal Interoperability [Frew] n Earth science Information Providers: n n n However, no universal agreement on: n n Obtain and disseminate their products through HTTP services Collectively agree on how their products and services are to be referenced by URIs Formats on which Earth Science Objects can be encoded for transmission Services by which common transformations of these objects can be requested Nomenclature for referring to products and services Naming is central and therefore name discovery: n n n Search engines like Google Directories like dmoz Informal patterns like www. companyname. com ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Why not Google for Spatial Data [Frew] Search qualifiers for spatial information are non-textual n Content-based indexing (and searching) is still a research problem n Ranking is based on human-created metadata (hyperlinks). [Notice that such metadata is non -intentional!] n ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Problems with Using Words [Mark] Geospatial information is often about objects (fields, digital elevation models, or images), not words n Classification of landscape elements varies across languages, cultures n Different definitions of landscape element types may lead to different subdivision of land into objects n Interaction of all the above n ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
User Support n Hide complexity from user: Languages for ontologies, but allow users to add semantic descriptions of geographic information [Lutz, Klien] n Ontologies and reasoning [Lutz, Klien] n Semi-automate: n n Complex process services [Lutz, Klien] n Ontology mappings [Cruz] n Graphical user interfaces for mapping between ontologies [Cruz] ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Database Issues in Data and Knowledge Integration [Cohn] Materialized vs. virtual integration n Data extraction, cleaning, and reconciliation n Updates made on global schema and on sources n Answering queries on the global schema n Modeling of the global schema, sources, relationships between the two n ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
Querying n n n Queries can be enriched with additional ontology concepts and then be sent to conventional catalog services [Lutz, Klien] Query Expressiveness and Logical Reasoning [Lutz, Klien] Allowing for full-fledged database queries: keyword and XML querying (possibly GML) [Wiegand, Cruz] Querying of sensor networks, includes aggregation, new optimization criteria [Stefanidis] Unlike in traditional databases, we need to first locate the Geospatial Data sources [Wiegand]: n n Use of metadata Template querying ADVIS Lab – http: //www. cs. uic. edu/~advis December 3, 2004
- Slides: 60