Potters Wheel An Interactive Data Cleaning System Vijayshankar

  • Slides: 29
Download presentation
Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein

Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein

Outline Background Potter’s Wheel architecture Discrepancy detection Interactive transformation Conclusions and Future Work

Outline Background Potter’s Wheel architecture Discrepancy detection Interactive transformation Conclusions and Future Work

Motivation Dirty data common n E. g. , in content integration, e-catalogs n Inter-organizational

Motivation Dirty data common n E. g. , in content integration, e-catalogs n Inter-organizational differences in data representation n Home Depot: 60, 000 suppliers! Data often scraped off web pages, etc. E. g. in centralized systems n Data entry “errors”, poor integrity constraints Cleansing a prereq for analysis, xactions Cleansing done by “content managers” n Ease of use critical! n n Standards can help a bit (e. g. UDDI) But graphical tools are the name of the game

Current solutions Code Detect errors in data n n n “eyeball” data in a

Current solutions Code Detect errors in data n n n “eyeball” data in a spreadsheet data auditing tools domain-specific algorithms Detect Apply Code up transforms to fix errors n n n “ETL” (extract/transform/load) tools from warehousing world string together domain-specific cleansing rules scripting languages, custom code, etc. Apply transforms on data Iterate n n special cases nested discrepancies, e. g. 19997/10/31

Problems Slow, batch tasks Significant human effort! n Specification of transforms n n regular

Problems Slow, batch tasks Significant human effort! n Specification of transforms n n regular expressions, grammars, custom scripts, etc. Discrepancy detection n notion of discrepancy domain-dependent want a mix of custom and standard techniques want to apply on parts of the data values Rebecca by Daphne du Maurier (Mass Market Paperback) Sonnet 19. Craig W. J. , ed. 1914. The Oxford Shakespeare The Big Four Agatha Christie, Mass market paperback $6. 29 **** 5. 39 10% (from bartleby. com, bn. com)

Outline Background Potter’s Wheel architecture Discrepancy detection Interactive Transformation Conclusions and Future Work

Outline Background Potter’s Wheel architecture Discrepancy detection Interactive Transformation Conclusions and Future Work

Potter’s Wheel: Design Goals Eliminate wait time during each step n n n Even

Potter’s Wheel: Design Goals Eliminate wait time during each step n n n Even on big data! Use Online Reordering (VLDB ‘ 99), sampling Ensure transform results can be seen/undone instantly Compile/optimize sequence of transforms when happy Eliminate programming, but keep user “in the loop” n n n Semi-automatic, “direct manipulation” GUI Support & leverage “eyeball” detection, verification (human input) Point-and-click transformation “by example” Unify detection and transformation n n Detection always runs online in the background Detection always runs on transformed “view” of data Extensibility Limited appreciation n Domain experts (vendors) should be able to plug for this kind of in detectors/transforms systems work A mixed (“Systems!”) design challenge: n Query Processing, HCI, Learning

Potter’s Wheel UI Data read so far

Potter’s Wheel UI Data read so far

Dataflow in Potter’s Wheel scroll check for errors spe c tran ify/un d sfo

Dataflow in Potter’s Wheel scroll check for errors spe c tran ify/un d sfo rms o Spreadsheet display Discrepancy detector Data source Transformation engine get scr page ollb ar pos. n compile Optimized program Online reorderer

Outline Background Potter’s Wheel architecture Discrepancy detection Domains in Potter’s Wheel n Structure inference

Outline Background Potter’s Wheel architecture Discrepancy detection Domains in Potter’s Wheel n Structure inference n Interactive Transformation Conclusions and Future Work

Discrepancy Detection Challenge: find discrepancies in a column Structure inference: n Given: n n

Discrepancy Detection Challenge: find discrepancies in a column Structure inference: n Given: n n n A set of (possibly composite) data items, including discrepancies A set of user-defined “domains” (atomic types) Choose a “structure” for the set n n A string of domains (w/repetition) that best fits the data E. g. for “March 17, 2000”: n S* PS: Must be an n alpha* digit*, digit* online algorithm! n [Machr]* 17, int Report rows that do not fit chosen domain

Extensible Domains As in Object-Relational, keep domains opaque. class Domain { // Required inclusion

Extensible Domains As in Object-Relational, keep domains opaque. class Domain { // Required inclusion function boolean match(char *value); // Helps in structure extraction int cardinality(int length); // For probabilistic discrepancy checking float match. With. Confidence(char *value, int data. Set. Size); void update. State(char *value); // Helps in parsing boolean is. Redundant. After(Domain d); } e. g. integer, ispell word, money, standard part names

Evaluating Structure Fit Three desired characteristics n Recall n n Precision n match as

Evaluating Structure Fit Three desired characteristics n Recall n n Precision n match as many values as possible flag as many real discrepancies as possible e. g. Month day, day over alpha* digit*, digit* Conciseness n n avoid over-fitting examples, make use of the domains e. g. alpha* digit*, digit* over March 17, 2000

Evaluating Structure Fit, cont. Given structure S = d 1 d 2…dp, string vi,

Evaluating Structure Fit, cont. Given structure S = d 1 d 2…dp, string vi, how good is S? Minimum Description Length (MDL) principle n n Rissanen, ‘ 78, etc. DL(vi, S) = length of theory for S + length to encode string vi with S Computing DL(v, S) 1) 2) 3) Length of theory = p log (number of domains known) If vi doesn’t match S, encode it explicitly Else encode vi = wi, 1 wi, 2 …wi, p where wi, j dj n n Encode length of each wi, j Encode each wi, j among all dj’s of length j n n use cardinality function DL = AVGi((1) + (2) + (3)) = AVGi (Un. Conciseness + Un. Precision + Un. Recall) Choose structure with minimum DL(v, S) n Hard search problem; heuristics in paper

Potter’s Wheel UI

Potter’s Wheel UI

Outline Background Potter’s Wheel architecture Discrepancy detection Interactive Transformation transforms n split-by-example n Conclusions

Outline Background Potter’s Wheel architecture Discrepancy detection Interactive Transformation transforms n split-by-example n Conclusions and Future Work

Interactive transformation Sequence of simple visual transforms n rather than a single complex program

Interactive transformation Sequence of simple visual transforms n rather than a single complex program Each transform must be n n easy to specify immediately applicable on screen rows Must be able to undo transforms n n compensatory transforms not always possible everything REDO-oriented at display-time n no need for UNDO!

Transforms in Potter’s Wheel Value translation n Format(value) – reg. expr. substitution, arithmetic ops,

Transforms in Potter’s Wheel Value translation n Format(value) – reg. expr. substitution, arithmetic ops, … One-to-one row mappings n n n Add/Drop/Copy columns Merge, Split columns Divide column by predicate One-to-many row mappings n Fold columns n n adapted from Fold of Schema. SQL[LSS’ 96] Resolve some higher-order differences

Example (1) Format Bob Stewart '(. *), (. *)' to '2 1'Anna Davis Jerry

Example (1) Format Bob Stewart '(. *), (. *)' to '2 1'Anna Davis Jerry Dole, Jerry Joan Marsh Split at ' ' Stewart, Bob Stewart 2 Merges Bob Stewart Anna Davis Jerry Dole Joan Marsh

Example (2) Stewart, Bob Anna Davis Dole, Jerry Joan Marsh Divide (like ’. *,

Example (2) Stewart, Bob Anna Davis Dole, Jerry Joan Marsh Divide (like ’. *, . *’) Stewart, Bob Dole, Jerry Anna Davis Joan Marsh

Example (3) Name Math Bio 2 Formats Name Ann 43 78 (demotes) Ann Math:

Example (3) Name Math Bio 2 Formats Name Ann 43 78 (demotes) Ann Math: 43 Bio: 78 Bob 96 54 Bob Math: 96 Bio: 54 Fold Name Ann Math 43 Ann Math: 43 Split Ann Bio 78 Ann Bio: 78 Bob Math 96 Bob Math: 96 Bob Bio 54 Bob Bio: 54

Transforms summary Power n n all one-to-{one, many} row mappings interactive many-to-{one, many} mappings

Transforms summary Power n n all one-to-{one, many} row mappings interactive many-to-{one, many} mappings hard to do interactively n n must find/display companion rows for each row to transform higher-order transforms Specification n n click on appropriate columns and choose transform but, Split is hard n n important transform in screen-scraping/wrapping need to enter regular expressions not always unambiguous e. g. Taylor, Jane, $52, 072 Tony Smith, 1, 00, 533 n want to leverage domains

Split by Example User marks split positions on examples System infers structure, then parses

Split by Example User marks split positions on examples System infers structure, then parses rest Taylor, Jane|, $52, 072 Tony Smith|, 1, 00, 533 infer structures < * >, <‘, ’ Money> Parsing n n must identify matching substrings for structures multiple alternate parses could work n n search heuristics explored in paper Decreasing. Specificity seems good

Related Work Transformation languages -- e. g. Schema. SQL, YATL Data cleaning tools n

Related Work Transformation languages -- e. g. Schema. SQL, YATL Data cleaning tools n n commercial -- ETL and auditing tools research -- e. g. AJAX, Lee/Lu/Ling/Ko ’ 99 Custom auditing algorithms n n n de-duplication (e. g. Hernandez/Stolfo ’ 97) outlier detection (e. g. Ramaswamy/Rastogi/Shim ’ 00) dependency inference (e. g. Kivinen/Manilla ’ 95) Structure extraction techniques n e. g. XTRACT, Data. Mold, Brazma ‘ 94 Transformation tools n n text-processing tools – e. g. perl/awk/sed, LAPIS screen-scraping -- e. g. No. Do. SE, XWRAP, On. Display, Cohera Connect, Telegraph Screen Scraper (Te. SS) Middleware, schema mapping

Conclusions Interactive data cleaning n n Couple transformation and discrepancy detection Perform both interactively

Conclusions Interactive data cleaning n n Couple transformation and discrepancy detection Perform both interactively n n short, immediately applied steps specify visually, undo if needed contrast with declarative language Parse values before discrepancy detection n user-defined domains helpful Software online (http: //control. cs. berkeley. edu/abc)

Looking Ahead Generalizing transform by example Transforming nested data (XML, HTML) More complex domain-expressions

Looking Ahead Generalizing transform by example Transforming nested data (XML, HTML) More complex domain-expressions Extend to generalized query processor client in Telegraph n n specify initial query refine by specifying transforms as results stream in dynamically choose transforms to be pushed into server See Shankar’s upcoming thesis, Telegraph papers

Backup Slides

Backup Slides

Optimization of Transform Sequences In Potter’s Wheel system generates program at end n hence

Optimization of Transform Sequences In Potter’s Wheel system generates program at end n hence opportunities for optimization remove redundant operations avoid expensive memory copies/allocations/deallocations by careful pipelining materialize intermediate strings only when necessary up to 110% speedup for C programs n C programs 10 x faster than Perl scripts

Example vs

Example vs