Implementation of Probabilistic Matching in NYC Chronic Hepatitis

Implementation of Probabilistic Matching in NYC Chronic Hepatitis B and NYC A 1 C Registries, and Implications Towards an MPI Maushumi Mavinkurve Director, Center for Data Matching NYC Department of Health and Mental Hygiene October 17 th, 2008 Integrated Surveillance Seminar

Overview • Describe data quality challenges in disease surveillance • Describe probabilistic matching techniques • Implementation of probabilistic matching – NYC Chronic Hepatitis B Registry (LVR) – NYC Hemoglobin A 1 C Registry (NYCAR) • NYC proposed challenges and benefits of an MPI

Public Health Surveillance • Public health surveillance process includes: – Collection of Data on a specific disease or condition via standardized information systems – Analysis and interpretation the data – Dissemination of information to individuals who can act on it – Utilization of information to facilitate necessary response that will effectively deal with the public health issue

Surveillance Data Quality Issues • Accuracy • Non-standardized across different data sources – Multiple laboratory systems • De-duplication of reports – Exact duplicates – Multiple events linked to a unique person • Non-relevant information

Impact of Data Quality Issues in Surveillance • Impacts on surveillance reporting – Over or underestimates of true cases – Geographical misrepresentation (missing address) • Increases costs – Additional staff required to address data quality issues • Increases inefficiencies – Timeliness for patient or provider follow up

Addressing Data Quality Challenges Modern disease surveillance information systems: • Validates data at time of collection – Minimize inaccurate or incomplete data • Standardizes different data to uniform structure • Integrates matching technology to create – Patient indexes (person-centric systems vs eventcentric systems) – Providers indexes – Facility indexes

What is Probabilistic Matching? • Rule based match algorithms • Standardizes Data • Parses data into smaller tokens • Create fields that enhance matching • Adapt to specific data - incorporates uniqueness or frequency of data values when comparing records • Processes data in blocks – viable to use on large volume data sets

Evaluating Match Algorithm • Outcome of a potential match is a weight or likelihood that 2 records are the same entity • Surveillance programs identify thresholds for match algorithm • Prior to reviewing results of match algorithm: – Identify implications for precision (PPV) vs negative predictive valuen (NPV) • Evaluation of health code mandate • Practical issues • Surveillance reporting – Identify guidelines or criteria to review matches

Identifying Thresholds • Goal: maximize precision or PPV • Sacrifice on negative predictive value (NPV) • Surveillance programs can decide to review ambiguous matches Therefore - set high thresholds

Outcome of Probabilistic Matching Entity-centric, relational registry system

Background of Hepatitis B in NYC • Decline in acute Hepatitis B incidents case rates (per 100, 000 persons) from 11. 5 in 1985 to 1. 6 in 2006 • In NYC burden of chronic Hepatitis B infection as much as 2 x higher within specific populations – MSM – IDU – Persons born in regions where HBs. Ag prevalence >2% • Need for continued surveillance and monitoring Source: recommendations for identification and public health management of persons with chronic Hepatitis B infection http: //www. cdc. gov/mmwr/preview/mmwrhtml/rr 5708 a 1. htm

Hepatitis B Surveillance Activities • Monitor disease trends • Aggregate descriptive reporting aimed to guide prevention and intervention efforts • Outreach with newly infected – Educational materials to new cases reported to the registry

NYC Hepatitis B Registry • Legacy application, built in-house in 1999 • Automatic weekly batch uploads of laboratory reports • Data entry of provider reports • System did not index on patients (event-based), could not link 2 reports for the same person. • Program utilized staff to build and apply deterministic match algorithms – Resource intensive – Version control

NYC Liver Virus Registry (LVR) • Implemented in October 2008, built inhouse • Migrated all legacy data • Web-based application • Person-centric - integrates probabilistic matching • Consolidated views of all information for a person • Ability to conduct longitudinal analysis

LVR Probabilistic Matching • Created a match algorithm based on fields unique to patient from laboratory and provider reports • Processed all legacy data ~380, 000 records • Program evaluated algorithm and identified thresholds • Results: out of ~380, 000 reports the match algorithm was able to link these to ~111, 000 unique persons • Probabilistic matching enhanced duplication by 1% as compared to legacy deterministic algorithm

LVR Challenges & Successes • Challenges: – Iterative review process time and resource intensive – Evaluation against legacy deterministic match – Identifying target PPV and NPV • Successes: – – – Long term savings on time and resources Streamlined system Longitudinal analysis More accurate case counting Enhanced data quality

Implementing Probabilistic Matching with NYC Hemoglobin A 1 C Registry (NYCAR)

What is Diabetes? • Diabetes is a chronic disease caused by inadequate insulin levels or sensitivity leading to elevated blood sugar levels • Blood sugar levels can be measured by – Plasma glucose – Fingerstick glucose – Glycosylated hemoglobin or A 1 C (goal is <7%) • Persistently high blood sugar levels can cause – Heart disease and stroke – Kidney failure – Blindness – Nerve damage and amputation

Diabetes Burden in NYC • Diabetes is epidemic in NYC • Prevalence has more than doubled over the past 10 years. • Approximately 500, 000 New Yorkers have diabetes • An additional ~200, 000 New Yorkers have diabetes, but have not yet been diagnosed • Approximately 1 in 8 adults have diabetes • In 2006, diabetes was the 4 th leading cause of death in NYC

Use of Traditional Public Health Surveillance for Chronic Disease reporting to public health agency to: – Monitor trends Describe glycemic control in NYC – Identify special populations Target individuals with poor control – Communicate with provider community Feedback to providers and their patients – Control epidemics Decrease complications/improve quality of life

Hemoglobin A 1 C Tests • A 1 C is a measure of average blood sugar control in preceding 3 months (goal <7%) • A 1 C is used to: – Monitor individual’s blood sugar control – Guide changes in medication therapy – Impart risk of diabetes complications • Most people who get A 1 Cs have diabetes so it is a marker for diabetes status THEREFORE, AN A 1 C REGISTRY WILL PROVIDE A MECHANISM FOR TRACKING INDIVIDUALS WITH DIABETES

Implementation of NYCAR • Based on existing NY State / NYC laboratory reporting system • Amendment to NYC health code, Article 13 which mandates communicable disease reporting, to include A 1 C – Public hearing Summer 2005 – Approval of amendment December 2005 – Went into effect January 15, 2006 • Laboratories submitting data to NY State and NYC subject to mandate – Report information on patient, ordering provider and facility, testing facility and result – Submit via secure network • Receive ~5, 000 new lab reports daily – High Volume

Objectives of New York City A 1 C Registry (NYCAR) • Surveillance and epidemiology – Track trends on the population level • Provider feedback and communication – Quarterly provider reports in comparison to peers – Quarterly rosters of patients stratified by A 1 C level • Patient feedback (via provider) – Letters with A 1 C information – Local resources • Deliver resources to providers/patients All of the above requires matching and data linkages

Components of A 1 C Registry • Information collected by laboratory reports include: – Individual name, address, date of birth, sex – Name and address of ordering provider, ordering facility and testing facility – A 1 C test collection date and result

NYCAR Probabilistic Methodology • Created 3 separate matching models: – Patient – Provider – Ordering Facility • Obtained a representative sample of data • For each model - created a match algorithm utilizing fields that uniquely identify each entity – Name (patient, provider, ordering facility), patient dob, gender, address, provider. ID, telephone number, etc. • Provided match results to program for review and identify thresholds

Program Threshold Evaluation • Due to volume of reports, impractical for staff to review all ambiguous matches – need to set thresholds • Method to identify of thresholds using sample – 2 reviewers and 1 tie-breaker scored matches referencing guidelines – Utilized a sampling method within weight ranges – Identified specific weight or threshold at which target precision rates were met based on review

Deploying Probabilistic Matching • All new incoming A 1 C lab reports parsed into 3 staging entities: – patient, provider and facilities • Each entity is matched against existing respective entities in the registry – If matched above thresholds, linked to an existing record – If below thresholds, creating a new entity (patient, provider or facility) • Provider Reports and Rosters and Patient Letters are generated using an in-house developed application which reads from the registry

Facility Report Page 2 Note: All information in this slide is fictitious Page 1

Provider Report Note: All information in this slide is fictitious

Patient Letter

Challenges and Successes • Challenges – Quality of record linkage • Need sufficient information for successful linkage of multiple tests per individual as well as master provider and facility indexing • Maintaining accurate facility-provider linkage – Effect of laboratory variation – availability of data – Review thresholds – time and resource intensive • Successes – Entire process is seamless, electronic and automated – High volume of data – Ability to conduct Longitudinal analysis

Is NYC ready for an MPI?

NYC Current Status • Modernizing several disease registries: – Chronic Hepatitis B - completed – NYCAR – completed – STD – requirements completed – TB – requirements completed – HIV – planning Is this an opportune time to develop an MPI?

Planning an MPI: Challenges • Each registry program has requirements for a matching based on: – Patient population – Data quality and volume – Dissemination/Use of Surveillance data • Foster consensus among disease programs – Breach of Security – higher risk – Legal barriers to creating an MPI • Analysis of health code by reportable disease – Political barriers to creating an MPI

Planning an MPI: Benefits • Pooling data from different sources could enhance PPV and NPV of the match • Streamline IT resources – Support staff – Infrastructure • Ability to conduct syndemic surveillance and investigation • More efficient use of limited resources

Acknowledgements Diabetes Prevention and Control Program • • Lynn Silver Shadi Chamany Angela Merges Charlotte Neuhaus Bahman Tabei Cindy Driver Leslie Korenda Bureau of Chronic Disease Control • • • Katherine Bornschlegl Magdalena Berger Emily Lumeng Division of Informatics and Information Technology Division of Epidemiology • • Lorna Thorpe Bonnie Kerker Jenna Mandel. Ricci Ram Koppaka • • • Don Weiner Stephen Giannotti Namrata Kumar Jisen Ho Laura Goodman

Questions? Maushumi Mavinkurve Director, Center for Data Matching NYC Department of Health and Mental Hygiene mmavinku@health. nyc. gov (P) 212 515 5182