Websites http mor nlm nih govdownloadrxnav http www
Websites ¡ ¡ http: //mor. nlm. nih. gov/download/rxnav/ http: //www. stccmop. org/quarry NEDS 2008 1
Charting a Dataspace: Lessons from Lewis and Clark David Maier Department of Computer Science Portland State University & Microsoft Research 2
With Much Support Dataspaces: Alon Halevy, Mike Franklin Rx. Safe: Paul Gorman, Karl Ordelheide, Judy Logan, Nick Rayner (Info. Sonde) SACO: Shannon Mc. Weeney, Ranjani Ramakrishnan Quarry: Bill Howe, James Rucker DIESEL: Lois Delcambre, David Archer, Susan Price, Scott Fletcher, John Mc. Call Funding: NSF ACI 0121475, IIS-0534762 AHRQ 1 UC 1 HS 014928 -01, DARPA NEDS 2008 3
Dataspaces* Deal with all the data from an enterprise – in whatever models ¡ Data co-existence ¡ Might not be fully integrated, especially early on Pay-as-you-go services ¡ I’m interested in understanding sources and their relationships ¡ * “From Databases to Dataspaces: A New Abstraction for Information Management”, Michael Franklin, Alon Halevy, David Maier, SIGMOD Record, December 2005. NEDS 2008 4
Example Dataspace: Rx. Safe Consolidated medication list for rural elders Points in lifetime of a prescription l l Order (clinic, hospital) Dispensing (pharmacy) Approval (insurer) Administration (rehabilitation facility) Relevant Standards NDCD, Rx. Norm, NDF-RT NEDS 2008 5
NDCD: National Drug Code Directory firms listings l_seq_no lblcode prodcode tradename f_seq_no lblcode firm_name packages formulation l_seq_no strength unit ingredient_name l_seq_no pkgcode pkgsize Codes for drug packages NEDS 2008 6
Sample NDC: 62584 -023 -00 l_seq_no lblcode prodcode tradename f_seq_no 172062 59064 62584 023 Vicodin tab l_seq_no strength unit ingredient_name 172062 5 MG Hydrocodone 172062 500 MG Acetaminophen l_seq_no pkgcode pkgsize 172062 00 100 f_seq_no lblcode firmname 59064 Amerisource 62584 NEDS 2008 7
Rx. Norm: Drug Nomenclature Rx. Nav from National Library of Medicine NEDS 2008 8
NDF-RT National Drug File – Reference Terminology From Veterans Affairs Drug class ¡ Chemical class ¡ Effects and actions ¡ NEDS 2008 9
NDF-RT (Blue) NEDS 2008 10
People Who Would Benefit Physician – what is the patient actually getting ¡ Pharmacist – interaction, duplication ¡ Assisted-living-facility (ALF) nurse – monthly reconciliation ¡ Emergency Department – what might be in the patient’s body ¡ Patient – what should I be taking? ¡ NEDS 2008 11
Lewis and Clark Expedition* William Clark ¡ ¡ Meriwether Lewis Explore western US Corps of Volunteers for North Western Discovery “Corps of Discovery” ¡ 1804 -1806 *Note: Largely based on Lewis and Clark: The Bicentennial Exhibition and the accompanying book Lewis and Clark—Across the Divide by Carol Gilman, 2003 NEDS 2008 12
Their Route Source: www. sd 4 history. com NEDS 2008 13
Charting the Country, Charting a Dataspace Diversity of purposes ¡ Myths and legends ¡ Evaluating maps ¡ Alternative models of the world ¡ Translating between languages ¡ Surveying the countryside ¡ Generic description languages ¡ Changing landscape ¡ NEDS 2008 14
Lewis & Clark: Different Purposes Thomas Jefferson claimed different purposes to different audiences • Congress: Customers for trade • Cabinet: Settlement by US, keep Great Britain out • British, Source: www. thecemeteryproject. com French: Purely scientific Observation vs. Evaluation vs. Diplomacy NEDS 2008 15
Louisiana Purchase Source: NOAA US bought territory from France in 1803 -4 Additional purpose: Inform people of new sovereignty NEDS 2008 16
Rx. Safe: Different Purposes Grouping similar medications ¡ Connecting possible incarnations of same prescription ¡ Generic – Brand Name ¡ Combining medication information for a given patient Must be error preserving NEDS 2008 17
Lewis & Clark: Myths and Legends Northwest passage sea, inland sea, river + short portage “symmetrical geography” “the pyramidal height of land” Mammoths Volcano NEDS 2008 18
Rx. Safe: Myths and Legends NDC and Rx. Norm talking about same things l l l NDC tradenames: 18913 Rx. Norm brand names: 7600 Strings in common: 418 All Rx. Norm relationships have explicit inverses NEDS 2008 19
Lewis & Clark: Evaluating Maps were incomplete Alexander Mac. Kenzie. Source: U. Virginia Library NEDS 2008 20
Aaron Arrowsmith. Source: www. monticello. org NEDS 2008 21
Rx. Safe: Incomplete Maps NEDS 2008 22
Rx. Safe: Incomplete Maps Doesn’t mention atoms, attributes Doesn’t include SY, ET, OCD, OBD NEDS 2008 Source: National Library of Medicine 23
Lewis & Clark: Mapping Conventions European maps: Distance, direction Indian maps might be Measured in time l Diagrammatic l Non-constant direction l Routes vs. geographic features Can depend on primary means of travel: foot, horse, river, sea l NEDS 2008 24
Shehek-Shote Map (Mandan) NEDS 2008 25
Clatsop Map NEDS 2008 26
Rx. Safe: Understanding Diagrams Rx. Norm diagram is for instances Multi-ingredient drug case not covered NEDS 2008 27
Independence of Sources Lewis & Clark: Maps not independent Arrowsmith Map King Map Mac. Kenzie Map Rx. Safe Rx. Norm based in part on NDCD (including errors) NEDS 2008 28
Lewis & Clark: Alternative World Models European view of North America Britain US Russia Spain NEDS 2008 29
Indian Division of the Territory NEDS 2008 Souce: Library of Congress 30
Structural Differences European: Political hierarchy – central authority speaks for all ¡ Indian: Individual relationships – different leaders camp, hunting, war Different meaning of relationships ¡ Parent-child l European: patriarchal l Indian: formal adoption w/ responsibilities NEDS 2008 31
Rx. Safe: Different World Models Product/Package Drug/Class Drug/Component NDC NDF-RT Rx. Norm Clinical Drug Component Branded Drug consists_of Component* + BN Component NEDS 2008 32
Lewis & Clark: Translating Between Languages English Lewis French François Labiche Hidatsu Toussaint Charbonneau Shoshone Sacagawea Cameahwait NEDS 2008 33
Rx. Safe: Translating Between Languages Physician Hydrocodone 5 mg/Acetaminophen 500 mg PO TID Pharmacist NDC: 6258402300 Vicodin, by mouth, 3 x day ALF White oblong pill w/ meals Patient Manufacturer NEDS 2008 34
Surveying the Countryside Lewis & Clark l l Dead reckoning, compass Celestial observation, chronometer Rx. Safe: Data profiling NDC l 45, 972 listings l 18, 913 tradenames l 109, 988 package rows l 2, 952 labeler codes NEDS 2008 35
Lewis & Clark: Generic Description Languages Chinook Wawa (Chinook Jargon) l Small number of concepts Combine to get more complex descriptions and relationships l Not very domain specific l NEDS 2008 36
Chinook Wawa Examples hyas tyee high chief king hyas puss high cat cougar salt chuck salt water ocean skookum chuck powerful water rapids mamook muckamuck make food cook hyas muckamuck someone who eats at the high table NEDS 2008 37
You Try It olo moosum hungry for sleepy olo chuck hungry for water thirsty mamook tusgh illahe make split the land plow opitsaht yakka sikhs the knife his friend fork NEDS 2008 38
Rx. Safe: Generic Description Language RXNCONSO RXNREL rxcui term_type string_val src_abbr rxcui 1 rxcui 2 rel src_abbr RXNAT rxcui att_name att_value Rx. Norm uses UMLS, not domain-specific More complex than this – can have several atoms in each concept NEDS 2008 39
Changing Landscape Lewis & Clark l l Range of tribes: nomadic, smallpox, war Wouldn’t find some river features today Rx. Safe Representation convention for synonyms changes across versions in Rx. Norm NEDS 2008 40
What I Want: Dataspace Charting Toolkit Familiarization, Profiling, Enhancement ¡ Inspector for generic models ¡ Dataspace profiler l l ¡ Assumption tracker and checker Structure discovery techniques Customization to task based on discovered characteristics NEDS 2008 41
“Green Field” Tools for Unfamiliar Dataspaces (Howe) ¡ Goal: A working, extensible application with the least possible (human) effort ¡ We need at least: l l a Data Model ¡ “Lowest Common Denominator” ¡ minimal modeling decisions an API ¡ easy to use for domain experts ¡ uniformly efficient NEDS 2008 42
Quarry Data Model ¡ resource, property, value l (subject, predicate, object) if you prefer no intrinsic distinction between literal values and resource values ¡ no explicit types or classes ¡ NEDS 2008 43
Example: Rx. Norm Concept Relationship Atom userkey 10001 prop value NDC 1 ORIG_CODE 123 ingredient_of 10004 type DC up to 23 M triples describing 0. 6 M concepts and atoms NEDS 2008 44
Example: Metadata for Scientific Data Repository Variable = “Salinity” Depth = “ 7” …/anim-sal_estuary_7. gif Type = “Animation” Region = “Estuary” 7. 5 M triples describing 1 M files path …/anim-sal_estuary_7. gif NEDS 2008 prop value depth 7 variable salt region estuary type anim 45
SKIP NEDS 2008 46
NEDS 2008 47
NEDS 2008 48
NEDS 2008 49
NEDS 2008 50
NEDS 2008 51
NEDS 2008 52
NEDS 2008 53
Quarry API …/2004 -001/…/anim-tem_estuary_bottom. gif aggregate animation day directory plottype region runid year = = = = bottom Describe(key) isotem 001 images isotem estuary 2004 -001 Properties(runid=2004 -001) 2004 : …/2004 -001/…/amp_plume_2 d. gif day directory plottype region runid year = = = 001 images Values(runid=2004 -001, “plottype”) 2 d plume 2004 -001 2004 NEDS 2008 54
API Clients Applications use sequences of Prop and Val calls to explore the Dataspace runid year week region | plume | far | surface runid year | 2003 | 2004 | 2005 show products… week region | estuary plottype variable NEDS 2008 55
Behind the Scenes Signatures l resources possessing the same properties clustered together l Posit that |Signatures| << |Resources| l Queries evaluated over Signature Extents NEDS 2008 56
Experimental Results ¡ Yet Another RDF Store Several B-Tree indexes to support ¡ spo, po s, os p, etc. ~3 M triples ¡ We looked at multi-term queries ¡ ? s <p 0> <o 0> ? s <p 1> <o 1> : ? s <pn> <on> NEDS 2008 57
Experimental Results: Queries 3. 6 M triples 606 k resources 149 signatures NEDS 2008 58
Hands-off Operation Feed it triples l l Calculates signatures Computes signature extents Working on incremental facility for insertions l l Resource can change signatures New signatures can be created API doesn’t name tables NEDS 2008 59
Related Work ¡ RDF: Redland, Sesame, Jena, YARS, Forth, KAON l Primarily Indexed Triple Stores ¡ Path Indexes: Lorel, Data. Guides ¡ Data Mining for Structure: l Ding, Wilkinson, Sayers, Kuno @ HP Labs “Application-specific Schema Design for Storing Large RDF Datasets” NEDS 2008 60
Dataspace Profiling Commercial profilers: Data. Flux, Infogix (ACR), Knowledge. Driver l l prep for cleaning, migration generally relational model, by table Potter’s Wheel l l [Raman, Hellerstein 2001] learning column transformation unfolding – data value to column labels Learn. PADS [Fischer, et al. 2007] NEDS 2008 61
Cross-Source Profiles Bellman [Dasu, et al. 2002] l l l Find joinable columns (1 -N, M-N) Is one field the composition of others? Part of T joins with T 1, part with T 2 Make the point that database schemas “devolve” with time as business processes change NEDS 2008 62
Assumption Checking NDC Examples (Rayner) l_seq_no is key of listings – yes ¡ lblcode, prodcode key of listings – no ¡ 45, 953 ¡ firm_name lblcode – no 2931 ¡ 45, 972 (19) 2952 (21) each product listing should have >0 packages and >0 ingredients 44, 972 (1180) NEDS 2008 45, 180 (792) 63
Checking Across Sources NDC vs. Rx. Norm Ingredients ¡ 2794 ingredient names in NDCD ¡ 5145 ingredients in Rx. Norm ¡ 1570 equal strings NEDS 2008 64
What to Do with Flawed Assumptions? 1. 2. Track exceptions Refine assumption firm_name, location lblcode 3. Refine knowledge of world Rx. Norm has ingredient variants (which have the same type as ingredients) Want to track assumptions as they evolve, results of checks NEDS 2008 65
Structure Discovery [Andritsos, et al. 2005] l l Clustering of columns, rows Ranking FDs by corresponding redundancy Could we get Rx. Nav-style interface automatically? NEDS 2008 66
Customization: Info. Sonde Support customizations appropriate to discovered data characteristics Three-part modules ¡ Probe: Check or discover properties ¡ Switch: Present applicable customizations ¡ Check: Test that chosen switch is still valid NEDS 2008 67
Functional Dependency Module Probe: Test for FD ¡ Switch: ¡ l l ¡ FD holds: add constraint, decompose FD fails: partition, repair Check: Example – if using decompose, check that FD still holds NEDS 2008 68
Linkage Extension Module Want to extend a linkage based on a discovered functional relationship Probe: Join satisfies FD here, ING DC Switch: Materialize functional relationship can be used to extend original relation with DC Check: Test that FD still holds NEDS 2008 69
Questions? Sacagawea Dollar NEDS 2008 70
Thank You! Lewis and Clark Nickel NEDS 2008 “Great joy in camp we are in View of the Ocian, this great Pacific Ocsean which we been so long anxious to see. ” 71
- Slides: 71