Quality views capturing and exploiting the user perspective

  • Slides: 18
Download presentation
Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne

Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of Manchester, UK Alun Preece, Binling Jin Department of Computing Science University of Aberdeen, UK http: //www. qurator. org

Integration of public data (in biology) Entrez Uni. Prot Ens. EMBL Gen. Bank db.

Integration of public data (in biology) Entrez Uni. Prot Ens. EMBL Gen. Bank db. SNP • Large volumes of data in many public repositories • Increasingly creative uses for this data • Their quality is largely unknown Combining the strengths of UMIST and The Victoria University of Manchester

Quality of e-science data A data consumer’s view on quality: Criteria for data acceptability

Quality of e-science data A data consumer’s view on quality: Criteria for data acceptability within a specific data processing context Defining quality can be challenging: • In-silico experiments express cutting-edge research – Experimental data liable to change rapidly – Definitions of quality are themselves experimental • Scientists’ quality requirements often just a hunch – Quality tests missing or based on experimental heuristics – Often implicit and embedded in the experiment not reusable Combining the strengths of UMIST and The Victoria University of Manchester

Example: protein identification “Wet lab” experiment Support evidence: provenance metadata Data output Protein identification

Example: protein identification “Wet lab” experiment Support evidence: provenance metadata Data output Protein identification algorithm Reference databases Protein Hitlist Quality filtering Remove likely false positives Improve prediction accuracy Protein function prediction Goal: to explicitly define and automatically add the additional filtering step in a principled way Combining the strengths of UMIST and The Victoria University of Manchester

Our goals Offer e-scientists a principled way to: • Discover quality definitions for specific

Our goals Offer e-scientists a principled way to: • Discover quality definitions for specific data domains • Make them explicit using a formal model • Implement them in their data processing environment • Test them on their data … in an incremental refinement cycle Benefits: • Automated processing • Reusability • “plug-in” quality components Combining the strengths of UMIST and The Victoria University of Manchester

Approach Research hypothesis: adding quality to data can be made cost-effective – By separating

Approach Research hypothesis: adding quality to data can be made cost-effective – By separating out generic quality processing from domain-specific definitions Define abstract quality views on the data Map quality view to an executable process Execute quality views Combining the strengths of UMIST and The Victoria University of Manchester Qurator architectural framework: - runtime environment - data-specific quality services

Abstract quality view model Quality Metadata Evidence e 3 Classification 1 e 2 e

Abstract quality view model Quality Metadata Evidence e 3 Classification 1 e 2 e 1 Coverage Classification 2 Assertions C 11 C 12 … Class space 1 C 22 … Class space 2 Peptides. Count Data annotation Data Combining the strengths of UMIST and The Victoria University of Manchester Conditions: regions specification Actions on regions

Semantic model for quality concepts Quality “upper ontology” (OWL) Quality evidence types Evidence annotations

Semantic model for quality concepts Quality “upper ontology” (OWL) Quality evidence types Evidence annotations are class instances Evidence Meta-data model (RDF) Combining the strengths of UMIST and The Victoria University of Manchester

Quality hypotheses discovery and testing Performance assessment abstract quality view Multiple target environments: •

Quality hypotheses discovery and testing Performance assessment abstract quality view Multiple target environments: • Workflow • query processor Compilation Targeted Compilation Execution on test data Target-specific Quality component Quality-enhanced User environment Combining the strengths of UMIST and The Victoria University of Manchester Deployment

Generic quality process pattern Collect evidence - Fetch persistent annotations - Compute on-the-fly annotations

Generic quality process pattern Collect evidence - Fetch persistent annotations - Compute on-the-fly annotations Persistent evidence Compute assertions Classifier Evaluate conditions Execute actions Combining the strengths of UMIST and The Victoria University of Manchester <variables <var variable. Name="Coverage“ evidence="q: Coverage"/> <var variable. Name="Peptides. Count“ evidence="q: Peptides. Count"/> </variables> <Quality. Assertion service. Name="PIScore. Classifier" service. Type="q: PIScore. Classifier" tag. Sem. Type="q: PIScore. Classification" tag. Name="Score. Class" <action> <filter> <condition> Score. Class in {``q: high'', ``q: mid''} and Coverage > 12 </condition> </filter> </action>

Bindings: assertion service (service registry) service class Web service endpoint PIScore. Classifier http: //localhost/axis/services/PIScore.

Bindings: assertion service (service registry) service class Web service endpoint PIScore. Classifier http: //localhost/axis/services/PIScore. Classifier. Svc All services implement the same WSDL interface • Makes concrete assertion functions homogeneous • Facilitates compilation • Uniform input / output messages Common WSDL interface D = {(di, evidence(di))} {class(di)} {score(di)} Combining the strengths of UMIST and The Victoria University of Manchester PIScore. Classifier. Svc PI_Top_k_svc

Execution model for Quality views Binding compilation executable component – Sub-flow of an existing

Execution model for Quality views Binding compilation executable component – Sub-flow of an existing workflow – Query processing interceptor Abstract Quality view Host workflow: D D’ QV compiler Qurator quality framework D Host workflow Embedded quality workflow D’ Combining the strengths of UMIST and The Victoria University of Manchester Quality view on D’ Services registry Services implementation

Example: original proteomics workflow Taverna (*): workflow language and enactment engine for e-science applications

Example: original proteomics workflow Taverna (*): workflow language and enactment engine for e-science applications Quality flow embedding point Combining the strengths of UMIST and Victoria of University Manchester project, University of Manchester - taverna. sourceforge. net (*)Thepart theofmy. Grid

Example: embedded quality workflow Combining the strengths of UMIST and The Victoria University of

Example: embedded quality workflow Combining the strengths of UMIST and The Victoria University of Manchester

Interactive conditions / actions Combining the strengths of UMIST and The Victoria University of

Interactive conditions / actions Combining the strengths of UMIST and The Victoria University of Manchester

Quality views for queries Actions: filtering, dump to DB / file Combining the strengths

Quality views for queries Actions: filtering, dump to DB / file Combining the strengths of UMIST and The Victoria University of Manchester

Qurator architecture Combining the strengths of UMIST and The Victoria University of Manchester

Qurator architecture Combining the strengths of UMIST and The Victoria University of Manchester

Summary For complex data types, often no single “correct” and agreed-upon definition of quality

Summary For complex data types, often no single “correct” and agreed-upon definition of quality of data • Qurator provides an environment for fast prototyping of quality hypotheses – Based on the notion of “evidence” supporting a quality hypothesis – With support for an incremental learning cycle • Quality views offer an abstract model for making data processing environments quality-aware – To be compiled into executable components and embedded – Qurator provides an invocation framework for Quality Views More info and papers: http: //www. qurator. org Live demos (informal) available Combining the strengths of UMIST and The Victoria University of Manchester