Make the Most of Your Data Using CDCs

  • Slides: 36
Download presentation
Make the Most of Your Data Using CDC’s Link Plus Free, Fast, and Efficient

Make the Most of Your Data Using CDC’s Link Plus Free, Fast, and Efficient Probabilistic Record Linkage Program Kathleen K. Thoburn CDC/NPCR Contractor Joe Rogers Team Lead, Data Analysis and Support Team, NPCR, CDC NCRA 2011 Annual Conference NPCR QC Track Orlando, Florida June 24, 2010 National Center for Chronic Disease Prevention and Health Promotion Division of Cancer Prevention and Control

Overview of Record Linkage q Can be accomplished manually, by visually comparing records from

Overview of Record Linkage q Can be accomplished manually, by visually comparing records from two separate sources or reviewing a single dat source for duplicate records q Approach becomes time consuming, tedious, inefficient, and unpractical as the number of records in the files increases q Technological advances in computer systems and programming techniques § Economically feasible to perform computerized record linkage on large files § Efficient and relatively accurate

Central Cancer Registry Record Linkage q Case Finding q Linking New Reports q Follow

Central Cancer Registry Record Linkage q Case Finding q Linking New Reports q Follow Up q Special Studies Consolidation q Duplicate Detection

Duplicate Detection q Fundamental requirement for accuracy and validity of count in any disease

Duplicate Detection q Fundamental requirement for accuracy and validity of count in any disease registry q National Program of Cancer Registries and North American Association of Central Cancer Registries standard § Maintain <= 0. 1% (<=1 per 1, 000) duplicates

Deterministic Matching q Computerized comparison where EVERYTHING needs to match EXACTLY: Last Name First

Deterministic Matching q Computerized comparison where EVERYTHING needs to match EXACTLY: Last Name First Name Site SSN DOB Sex Date. Dx SMITH JOHN C 619 123654789 02011934 1 06152004

Deterministic Matching q Often slight variations exist in the data between the two files

Deterministic Matching q Often slight variations exist in the data between the two files for the same variables: Last Name First Name Site SSN DOB SMITH JOHN C 619 123654789 02011934 1 06152004 SMYTH JOHN C 619 123654786 02081934 1 06102004 q Sex Date. Dx Or variables are missing from one of the files: Last Name First Name Site SSN DOB Sex Date. Dx SMITH JOHN C 619 123654789 02011934 1 06152004 SMITH JOHN C 619 02011934 1 06152004

Deterministic Matching Manual Review q When we manually review, we use intuition to help

Deterministic Matching Manual Review q When we manually review, we use intuition to help us identi positive matches for records containing slight variations in, missing information for, data between the two files for the same variables Last name First Name Site SSN DOB SMITH JOHN C 619 123654789 02011934 1 06152004 SMITH JOHN C 619 123654786 02101934 1 06152004 q Sex Date. Dx Typo in SSN, transposition of digits in the day component of DOB, but would still deem a match

Probabilistic Matching q What do Humans know? q How can we translate intuition into

Probabilistic Matching q What do Humans know? q How can we translate intuition into formal decision rules to b used by a computer? q Use the concept of PROBABILITY and perform PROBABILISTIC matching q Recommended over traditional deterministic (exact matching methods when: § coding errors, reporting variations, missing data or duplicate records q Estimate probability/likelihood that two records are for the same person versus not

Probabilistic Matching q Find the records in File 2 that seem to match records

Probabilistic Matching q Find the records in File 2 that seem to match records in File 1 q Calculate a score that indicates, for any pair of records, how likely it is that they both refer to the same person q Sort the likely and possible matched pairs in order of their sco q Define a threshold (Cut Off value) for automatically accepting and rejecting a potential link § Discard unlikely matched pairs (scores below Cut Off) § Gray area: range of scores considered as uncertain matches q Manually review uncertain matches

Probabilistic Matching q Agreement argues for linkage (higher score) q Disagreement argues against linkage

Probabilistic Matching q Agreement argues for linkage (higher score) q Disagreement argues against linkage (lower score) q Full agreement argues more strongly for linkage than partia agreement q Some types of partial agreements are stronger than others; probabilistic scores are § Field-specific – Birth date versus Sex § Value-specific - “Jane” versus “Janiqua”

Phonetic Systems q Phonetic coding involves coding a string based on how it is

Phonetic Systems q Phonetic coding involves coding a string based on how it is pronounced Soundex (120 + years old) § Code for a name consisting of a letter followed by three numbers: the letter is the first letter of the name, and the numbers encode the remaining consonants § Zeroes are added at the end if necessary to produce a fourcharacter code. Additional letters are disregarded. • Washington is coded W-252 (W, 2 for the S, 5 for the N, 2 for the G, remaining letters disregarded § Reduces matching problems due to different spellings § Simple and fast

Phonetic Systems New York State Identification and Intelligence System (NYSIIS; 1970 +) q Maps

Phonetic Systems New York State Identification and Intelligence System (NYSIIS; 1970 +) q Maps similar phonemes to the same letter; maintains relative vowel positioning q String can be pronounced by the reader without decoding § Deborah Walker = DABARA WALCAR q Improvement to the Soundex algorithm § More distinctive; people are more likely to have the same Soundex than the same NYSIIS § Reported accuracy increase of 2. 7% over Soundex § Studies suggest NYSIIS performs better than Soundex when Spanish names are used q Soundex may bring more pairs for comparison when used for blocking

Concept of Blocking q With so many comparisons, large files can make impossible resource

Concept of Blocking q With so many comparisons, large files can make impossible resource demands q Blocking is an initial probabilistic linkage step that reduces t number of record comparisons between files q Sort and match the two files by one or more identifying (“blocking”) variables q Comparisons subsequently made only within blocks § Discard very unlikely record-pairings from the start

Blocking Variables q Exact matches q Blocks of data to compare variables within q

Blocking Variables q Exact matches q Blocks of data to compare variables within q Common blocking variables are: § Last Name § Social Security Number § Date of Birth

Matching Variables q Probabilistic matching algorithms q Comparing variables within blocks q Common matching

Matching Variables q Probabilistic matching algorithms q Comparing variables within blocks q Common matching variables: § § § § Name--Last Name--First Name--Middle Sex Race Birth Date Social Security Number

Blocking Sock Pattern: 7 of 13 socks fall outside pattern block 6 of 13

Blocking Sock Pattern: 7 of 13 socks fall outside pattern block 6 of 13 socks within pattern block

Matching Within Blocks Blocking: Sock Pattern Matching: Sock Color & Size High Score Gray

Matching Within Blocks Blocking: Sock Pattern Matching: Sock Color & Size High Score Gray Area Low Score

Link Plus Software q Stand-alone probabilistic record linkage program q Combines ease of use

Link Plus Software q Stand-alone probabilistic record linkage program q Combines ease of use and statistical sophistication q Detects duplicates within a cancer registry, or links cancer registry files to external files q Supports North American Association of Central Cancer Registries files, fixed width files, delimited files, and CRS Plus database q Provides powerful support for manual review of uncertain matches

CDC–NPCR Link Plus Contacts Kathleen K. Thoburn, CDC/NPCR Contractor E-mail: kthoburn@cdc. gov David Gu,

CDC–NPCR Link Plus Contacts Kathleen K. Thoburn, CDC/NPCR Contractor E-mail: kthoburn@cdc. gov David Gu, CDC/NPCR Contractor E-mail: dgu@cdc. gov Tom Rawson, CDC Computer Programmer

Link Plus Is Free $0. 00

Link Plus Is Free $0. 00

Link Plus Is Easy To Use Link Plus gets you from HERE: Cancer Registry

Link Plus Is Easy To Use Link Plus gets you from HERE: Cancer Registry data for John Smith: Last name First Name Site SSN DOB Sex SMITH JOHN C 619 123654789 02111934 1 Date. Dx 06152004 Vital Statistics data for John Smith: Last name First Name DOB Death Date COD SSN Dth Cert # SMITH JOHN 02011934 03202006 C 100 01234 123654789 To HERE: Linked data for John Smith: Last name First Name Site SSN DOB SMITH JOHN C 619 123654789 02011934 Sex 1 Date. Dx Death Date COD 06152004 03202006 C 100 Dth Cert # 01234

Link Plus Is Easy To Use Without having to go HERE:

Link Plus Is Easy To Use Without having to go HERE:

Link Plus Is Easy To Use q Designed especially for cancer registry work §

Link Plus Is Easy To Use q Designed especially for cancer registry work § HOWEVER, can be used with any data q Mathematics largely hidden from user q Practical default values supplied for many tasks q Familiar Windows interface q Includes Help and test examples

Link Plus Deduplication Linkage Overview

Link Plus Deduplication Linkage Overview

Link Plus Linkage Overview Deduplication Linkage Steps: 1. Select Data Type for File 9.

Link Plus Linkage Overview Deduplication Linkage Steps: 1. Select Data Type for File 9. 2. Locate/Identify File 10. Enter Cut-off Value 3. Data Import for File 11. Select Direct/EM Method 4. Select Blocking Variables & Phonetic System 12. Specify Linkage File Name and Location 5. Select Matching Variables & Matching Methods 13. Perform Manual Review of Uncertain Matches 6. Select ID Variables 14. Export Merged File Define Missing Values

Blocking Variables q Exact matches q Blocks of data to compare variables within q

Blocking Variables q Exact matches q Blocks of data to compare variables within q Up to 10 fields may be selected for blocking q Common blocking variables are: § Last Name § Social Security Number § Date of Birth

Matching Variables q Up to 10 fields may be selected for matching q Recommended

Matching Variables q Up to 10 fields may be selected for matching q Recommended variables (Matching Methods): § § § § Name--Last (Last. Name) Name--First (First. Name) Name--Middle (Middle. Name) Sex (Exact) Race (Value-Specific) Birth Date (Date) Social Security Number (SSN)

Matching Methods q Exact q Generic String q Last Name/First Name/Middle q SSN (Social

Matching Methods q Exact q Generic String q Last Name/First Name/Middle q SSN (Social Security Number) q Zip Code q Date q Generic ID q Confirmation q Value-Specific (Frequency-Based)

Missing Values q Specify date format on the missing value grid

Missing Values q Specify date format on the missing value grid

Cut Off Value q The score value above which comparison pairs are accepted as

Cut Off Value q The score value above which comparison pairs are accepted as potential links and presented for review q Value should always be positive q Initial value of around 7 -10 recommended when using the recommended Matching Variables q Run linkage, and quickly review potential matches to identify lower and upper cut off scores § At what score do perfect matches end and uncertain matches (gray area) begin? § At what score do false matches begin?

Running Linkage & Linkage Process Progress Window q Linkage Process Progress window appears and

Running Linkage & Linkage Process Progress Window q Linkage Process Progress window appears and provides the user with feedback about the linkage process as it is run q The progress window provides feedback regarding the preparation of the configuration, the reading in of the data files, the blocking of the files, and the calculation of the linkag scores

Manual Review of Uncertain Matches

Manual Review of Uncertain Matches

Link Plus Deduplication Linkage Delimited File Export

Link Plus Deduplication Linkage Delimited File Export

Enhancements New to Version 3. 0 Data Link: § Removes the limitation on the

Enhancements New to Version 3. 0 Data Link: § Removes the limitation on the number of records included in File 2; the program works for any number of records in File 2 as long as the computer has sufficient memory to read in data from File 1 § Users can choose whether to write all potential matches (manyto-many linkages) or only the matches with the highest score to the linkage report (1 -to-many linkages) § Provides confirmation-like matching method for variables such as address that contributes positive weight for the linkage score with agreement but 0 weight with disagreement § Provides SSN-like matching method for a generic ID § Provides a new name matching method that is more robust against the frequency of names or outlier of names

Enhancements New to Version 3. 0 Manual. Review § Users can use “Assign Set

Enhancements New to Version 3. 0 Manual. Review § Users can use “Assign Set ID” (de-duplication linkages only) to group matches into mutually exclusive match sets § Removes the limitation of the maximum size of 30, 000 pairs on the manual review window; the new maximum is 300, 000 pairs § Provides option to allow users to assign match status by linkage score without overwriting any existing assigned match status Export § Users can export the results of manual review to a NAACCR formatted file or any other fixed width file format

Thank You! Be sure to stop by the Registry Plus booth with questions or

Thank You! Be sure to stop by the Registry Plus booth with questions or further demonstrations! Kathleen K. Thoburn, kthoburn@cdc. gov Joe Rogers, jrogers@cdc. gov For more information please contact Centers for Disease Control and Prevention 1600 Clifton Road NE, Atlanta, GA 30333 Telephone, 1 -800 -CDC-INFO (232 -4636)/TTY: 1 -888 -232 -6348 E-mail: cdcinfo@cdc. gov Web: www. cdc. gov The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention. National Center for Chronic Disease Prevention and Health Promotion Division of Cancer Prevention and Control