DUPLICATE RECORD DETECTION AHMED K ELMAGARMID PURDUE UNIVERSITY

INTRODUCTION TO THE PROBLEM Databases play an important role in today’s IT based economy

DATA QUALITY Data are not carefully controlled for quality nor defined in a consistent

DATA HETEROGENEITY While integrating data from different sources into a warehouse , organizations become

DATA QUALITY Data cleaning refers to the process of resolving identification problems in the

TERMINOLOGY Record linkage Identity uncertainty Record matching Coreference resolution Data deduplication Duplicate Record Detection

DATA PREPARATION Step before the duplicate record detection. Improves the quality of the data

STEPS IN DATA PREPARATION Parsing Data transformation Standardization

PARSING Locates, identifies and isolates individual data elements Makes it easier to correct, standardize

DATA TRANSFORMATION Simple conversions of data type Field renaming Decoding field values Range checking:

DATA STANDARDIZATION Represent certain fields in a standard format � Addresses US Postal Service

LAST STEP IN DATA PREPARATION Store data in tables having comparable fields. Identify fields

FIELD MATCHING TECHNIQUES o o o Most common sources of mismatches in database entries

CHARACTER BASED SIMILARITY Works best on typographical errors Edit distance � Shortest sequence of

CHARACTER BASED SIMILARITY • Affine gap distance Strings that have been truncated John R.

TOKEN BASED SIMILARITY Works best when word (tokens) are transposed Atomic Strings Computational average

PHONETIC SIMILARITY Comparison based on how words sound

NUMERIC SIMILARITY Considers only numbers � Convert numbers to text data � Simple range

SUMMARY OF METRICS Edit Distance (Levenshtein) Numeric Affine Gap Double Metaphone Smith Waterman Jaro

DUPLICATE RECORD DETECTION The methods described till now have been describing about similarity checking

CATEGORIZING METHODS • • Probabilistic approaches and supervised machine learning techniques Approaches that rely

PROBABILISTIC MATCHINGMODELS Models derived from Bayes theorem � Use prior knowledge to make decision

SUPERVISED LEARNING Relies on the existence of trained data. The trained data is in

DISTANCE BASED TECHNIQUES o o This method can be used when there is absence

RULE BASED TECHNIQUES Relies on business rules to derive key � Must determine functional

RULE BASED TECHNIQUES This figure depicts the equation theory that dictates the logic of

UNSUPERVISED LEARNING Classify data as matched or unmatched without a training set. The comparison

TECHNIQUES TO IMPROVE EFFICIENCY Reduce the number of record comparisons Improve the efficiency of

COMPARATIVE METRICS Elementary nested loop � Compare every record in one table to another

REDUCE RECORD COMPARISONS Blocking Sorted Neighborhood Clustering and Canopies Set Joins

BLOCKING Basic: Compute a hash value for each � Only compare records in the

SORTED NEIGHBORHOOD Create composite key, sort data, merge Assumption � Duplicate records � Highly

CLUSTERING AND CANOPIES Clustering: Duplicate records are kept in a cluster and only the

SOFTWARE TOOLS Open Architecture � Freely Extensible Biomedical Record Linkage (FEBRL) Python � TAILOR

DATABASE TOOLS Commercial RDBMS � SQL Server 2005 implements “fuzzy matches” � Oracle 11

CONCLUSIONS Lack of a standardized, large-scale benchmarking data set Training data is needed to

Slides: 37

Download presentation

DUPLICATE RECORD DETECTION AHMED K. ELMAGARMID PURDUE UNIVERSITY, WEST LAFAYETTE, IN Senior member, IEEE PANAGIOTIS G. IPEIROTIS LEONARD N. STERN SCHOOL OF BUSINESS, NEW YORK, NY Member, IEEE computer security VASSILIOS S. VERYKIOS UNIVERSITY OF THESSALY, VOLOS, GREECE Member , IEEE computer security. PRESENTED BY SHILPA MURTHY

INTRODUCTION TO THE PROBLEM Databases play an important role in today’s IT based economy Many businesses and organizations depend on the quality of data(or the lack thereof) stored in the databases. Any discrepancies in the data can have significant cost implications to a system that relies on information to function.

DATA QUALITY Data are not carefully controlled for quality nor defined in a consistent way across different data sources, thus data quality is compromised due to many factors. //examples Data errors. Ex: Microsft instead of Microsoft Integrity errors. Ex: Employee. Age=567 Multiple conventions for information. Ex: 44 W. 4 th street and 44 west fourth street.

DATA HETEROGENEITY While integrating data from different sources into a warehouse , organizations become aware potential systematic differences and these problems and conflicts fall under a umbrella term called as “DATA HETEROGENEITY”. Two types of heterogeneity can be distinguished: Structural heterogeneity and lexical heterogeneity.

DATA QUALITY Data cleaning refers to the process of resolving identification problems in the data. Structural heterogeneity Different record structure Addr versus City, State, and Zip code [1] Lexical heterogeneity Identical record structure, but data is different 44 W. 4 th St. versus 44 West Fourth Street [1]

TERMINOLOGY Record linkage Identity uncertainty Record matching Coreference resolution Data deduplication Duplicate Record Detection Name matching Merge purge Database hardening Instance identification

DATA PREPARATION Step before the duplicate record detection. Improves the quality of the data Makes data more comparable and more usable. Data preparation stage includes three steps.

STEPS IN DATA PREPARATION Parsing Data transformation Standardization

PARSING Locates, identifies and isolates individual data elements Makes it easier to correct, standardize and match data Comparison of individual components rather than complex strings For example, the appropriate parsing of the name and address components into consistent packets is a very important step.

DATA TRANSFORMATION Simple conversions of data type Field renaming Decoding field values Range checking: involves examining data in a field to ensure that it falls within the expected range , usually a numeric or date range Dependency checking: is slightly more complex kind of data transformation where we check the values in a particular field to the values in another field to ensure minimal level of consistency in data

DATA STANDARDIZATION Represent certain fields in a standard format � Addresses US Postal Service Address Verification tool � Date and time formatting � Names (first, last, middle, prefix, suffix) � Titles

LAST STEP IN DATA PREPARATION Store data in tables having comparable fields. Identify fields suitable for comparison Not foolproof Data may still contain inconsistencies due to misspellings and different conventions to represent data

FIELD MATCHING TECHNIQUES o o o Most common sources of mismatches in database entries is due to typographical errors The field matching metrics that have been designed to overcome this problem are : Character –based similarity metrics Token based similarity metrics Phonetic similarity metrics Numeric similarity metrics

CHARACTER BASED SIMILARITY Works best on typographical errors Edit distance � Shortest sequence of edit commands that can transform a string s into t � Three types of edit operations. If (cost =1) this version of edit distance is referred to as the “Levenshein” distance. � Insert, delete, replace operations. � Example. S 1=“tin” s 2= “tan” � We need to replace “I” to “A” to convert string s 1 to s 2. � The edit distance here is 1. because we needed only one operation to convert s 1 to s 2.

CHARACTER BASED SIMILARITY • Affine gap distance Strings that have been truncated John R. Smith versus Jonathan Richard Smith • Smith-Waterman distance Substring matching which ignores the prefix and suffix Example: Prof. John. R. Smith and John. R. Smith, Prof • Jaro distance Compares first and last name • Q-Grams Divides string into a series of substrings of length q. E. g. : NELSON and NELSEN are phonetically similar but spelled differently. The q-grams for these words are NE LS ON and NE LS EN.

TOKEN BASED SIMILARITY Works best when word (tokens) are transposed Atomic Strings Computational average WHIRL Weights words based on frequency to determine similarity The words in the database have a weight associated with it, which is calculated using a cosine similarity metric. Example: in a database of company names the words “AT&T” and “IBM” are less frequent than the word “inc. ” Similarity of John Smith and Mr. John Smith is close to 1. But the similarity of comptr department and deprtment of computer is zero since it doesn’t take care of misspelled words. Q-Grams with weighting Extends WHIRL to handle spelling errors

PHONETIC SIMILARITY Comparison based on how words sound

NUMERIC SIMILARITY Considers only numbers � Convert numbers to text data � Simple range queries Authors provided no insight in this area

SUMMARY OF METRICS Edit Distance (Levenshtein) Numeric Affine Gap Double Metaphone Smith Waterman Jaro Distance Metaphone Comparison Metrics Oxford Name Compression Atomic Strings NYSIIS WHIRL Soundex Q-Grams

DUPLICATE RECORD DETECTION The methods described till now have been describing about similarity checking in single fields. The real life situations consist of multiple fields which have to be checked for duplicate records.

CATEGORIZING METHODS • • Probabilistic approaches and supervised machine learning techniques Approaches that rely on domain knowledge or Generic distance metrics

PROBABILISTIC MATCHINGMODELS Models derived from Bayes theorem � Use prior knowledge to make decision about current data set � A tuple pair is assigned to one of the two classes M or U. M class represents(match) same entity, and the U class represents(non-match) different entity. � This can be determined by calculating the probability distribution. Rule-based decision tree � If-then-else traversal

SUPERVISED LEARNING Relies on the existence of trained data. The trained data is in the form of record pairs. These record pairs are labeled matching or not. SVM approach out performs all the simpler approaches. The post processing step is to create a graph for all the records linking the matching records. Records are considered identical using the transitivity relation applied on the connected components.

ACTIVE LEARNING

DISTANCE BASED TECHNIQUES o o This method can be used when there is absence of training data or human effort to create matching models. Treat a record as a one long field � Use a distance metric � Best matches are ranked using a weighting algorithm � Alternatively, use a single field Must be highly discriminating

RULE BASED TECHNIQUES Relies on business rules to derive key � Must determine functional dependencies � Requires subject matter expert to build matching rules

RULE BASED TECHNIQUES This figure depicts the equation theory that dictates the logic of domain equivalence. It specifies an inference about the similarity of the records.

UNSUPERVISED LEARNING Classify data as matched or unmatched without a training set. The comparison vector generally depicts which category it belongs to. If it does not then it has to be done manually. One way to avoid manual labeling is to use the clustering algorithms. Group together similar comparison vectors. Each cluster contains vectors with similar characteristics. By knowing the real class of only few vectors we can infer the class of all the vectors.

TECHNIQUES TO IMPROVE EFFICIENCY Reduce the number of record comparisons Improve the efficiency of record comparison

COMPARATIVE METRICS Elementary nested loop � Compare every record in one table to another table � Requires A*B comparisons (Cartesian product) which is very expensive Cost required for a single comparison � Must consider number of fields/record

REDUCE RECORD COMPARISONS Blocking Sorted Neighborhood Clustering and Canopies Set Joins

BLOCKING Basic: Compute a hash value for each � Only compare records in the same bucket record Subdivide files into subsets (blocks) � Soundex, NYSIIS, or Metaphone Drawback � Increases in speed may increase number of false mismatches � Compromise is multiple runs using different blocking fields

SORTED NEIGHBORHOOD Create composite key, sort data, merge Assumption � Duplicate records � Highly dependent will be close in sorted system upon the comparison key

CLUSTERING AND CANOPIES Clustering: Duplicate records are kept in a cluster and only the representative of a cluster is kept for future comparisons. This reduces the total number of record comparisons without compromising the accuracy. Canopies: The records are grouped into overlapping clusters called as “canopies” and then the records are compared which lead to better qualitative results.

SOFTWARE TOOLS Open Architecture � Freely Extensible Biomedical Record Linkage (FEBRL) Python � TAILOR – MLC++, DBGen � WHIRL – C++ � Flamingo Project - C � Big. Match - C

DATABASE TOOLS Commercial RDBMS � SQL Server 2005 implements “fuzzy matches” � Oracle 11 g implements these techniques in its utl_match package Levenshtein Distance Soundex Jaro Winkler

CONCLUSIONS Lack of a standardized, large-scale benchmarking data set Training data is needed to produce matching models Research diversion Databases emphasize simple, fast, and efficient techniques � Machine learning and statistics rely on sophisticated techniques and probabilistic models � More synergy is needed among various communities � Detection systems need to be adaptive over time