Sharing Seminar Address Matching GSS Best Practice and
Sharing Seminar: Address Matching GSS Best Practice and Impact (BPI) Sli. do #T 935
Welcome • • • About BPI What are sharing seminars? What is address matching? Agenda Questions and technical difficulties via Sli. do #T 935
Really basic address matching Ross Bowen – Valuation Office Agency Sli. do #T 935
Purpose – personalisation to increase response rates The Occupier, 1 Business Building, London, SW 11 Mr Bowen, 1 Business Building, London, SW 11 Dear Occupier, … Dear Mr Bowen, … Sli. do #T 935
Our method is primitive • Written in SAS - the code is dreadful to interpret, • Painfully slow (> 24 hrs), • Lots of manual “edge-casing”, REPLACE “FST” with “FIRST”; REPLACE “ 1 ST” with “FIRST”; etc. • Luck more than skill? • But it gets us somewhere (85% matches)… Sli. do #T 935
R/O GND FLR Flat 1, Fownes ST, LONDON, SW 11 2 TJ -- Capitalise string and extract the postcode using regular expression R/O GND FLR FLAT 1, FOWNES ST, LONDON [SW 112 TJ] -- Lots of replacement of common terms and abbreviations REAR 0 TH FLOOR FLAT 1, FOWNES STREET, LONDON [SW 112 TJ] -- Remove punctuation and whitespace REAR 0 THFLOORFLAT 1 FOWNESSTREETLONDON [SW 112 TJ] Sli. do #T 935
-- If you get a perfect match, keep it. -- Otherwise, match on postcode and use Levenshtein distance: 0 THFLOORFLAT 1 FOWNESSTREETLONDON [SW 112 TJ] REAR 0 THFLOORFLAT 1 FOWNESSTREETLONDON [SW 112 TJ] Levenshtein distance is the minimum amount of steps it would take to go from one string to another, editing one character at a time. In this case, L = 4. We look at this as a proportion of the length of the bigger string, = 4 / MAX(LENGTH(0 THFLOORFLAT 1 FOWNESSTREETLONDON), LENGTH(REAR 0 THFLOORFLAT 1 FOWNESSTREETLONDON) = 0. 1142857 -- Ignore matches with distance proportions over a tolerance = 0. 4, -- Match everything you can, only using records once. Sli. do #T 935
Questions Sli. do #T 935
Data science for address matching Iva Spakulova ONS Methodology 14 May 2018 Sli. do #T 935
What is the address index? Service that matches an input address string to a validated address and Unique Property Reference Number (UPRN) from Address Base (AB) Input address string Reference data (AB) Structured, tokenized Unstructured text Messy, containing typos … Complete & correct (more or less) Incomplete Snapshot of addresses at a given time Range from historic to very recent addresses, including businesses Organisation / business names are not always part of the address Sli. do #T 935
Address Index matching process Ø 1. Parsing: Before attempting field by field linking, one needs to parse the free text input to tokens like building name and number, street name, town, locality, and postcode. For this purpose, we train a classification algorithm (Conditional Random Fields), and we also implement strategies to handle cases that are not perfectly parsed. Ø 2. Candidate address retrieval: Combination of structured and unstructured search is then deployed using Elastic. Search to quickly compare the parsed input against 26 million addresses. Ø 3. Ranking and scoring: The service returns a short ordered list of candidate addresses (and UPRNs) including a measure of confidence in the presented results. Sli. do #T 935
1. Parsing method “ 10 CHURCH HOUSE PARK ST BRIDES WENTLOOGE NEWPORT GWENT NP 10 8 SP” Rules based Ø Could use regular expressions or look-ups (for town names for example) Machine learning Ø Conditional Random Fields algorithm (Discriminative Undirected Probabilistic Graphical Model) Sli. do #T 935
Features (those in red relevant to example) 1. Digits - all, some, none 2. Word - word unless digits then false 3. Length - digits length / word length 4. Ends in Punctuation 5. Directional (e. g. South, N, NW) 6. Outcode/Incode (e. g. RH 1) 7. Post Town (e. g. Newport) 8. Flat (e. g. appt, flat) 9. Company (e. g. CIC, CIO, LTD) 10. Road (e. g. road, rd, street, park, ffordd) 11. Residential (e. g. house, lodge, cottage, mews) 12. Business (e. g. office, hospital, care, bank) 13. Locational (e. g. basement, ground, top, lower) 14. Ordinal (e. g. first, 2 nd) 15. Number of Hyphenations 16. Has Vowels 17. Word is at the start / end of the string “ 10 CHURCH HOUSE PARK ST BRIDES WENTLOOGE NEWPORT GWENT NP 10 8 SP” House number Building name Street name Locality Sli. do #T 935 Town name County Postcode
2. Candidate address retrieval Elasticsearch is a fast, highly scalable opensource full-text search and analytics engine. It allows for complex search features and requirements. Ø Each parsed token of the input address is compared against relevant AB address fields Ø Matches allow for synonyms and fuzziness Ø Scores from individual matched fields are combined using custom query logic and boosting Ø Fall-back query on full address and bigrams Sli. do #T 935
3. Ranking and scoring Ø The last step in the process is to evaluate and convey quality of the match between the input and the retrieved candidate address (its UPRN) to users. Ø A single confidence score is calculated using currently available information such as the Elasticsearch score, a bespoke rule-based score, parsing properties and the difference/ratio of scores between candidates. Ø We report the confidence score as a percentage, because it combines an intuitive measure (people understand how good 65% is) with something that allows automatic filtering to cut off the end of a results set. Ø The threshold value varies depending on the user case. For example, if an individual match is requested and reviewed by human, a good threshold would be 5% because it allows more candidates to be displayed if there is any ambiguity. For a purpose of automated matching, only very good matches should be returned and therefore the recommended threshold is 60%-80%. Sli. do #T 935
Address Index performance Ø Control over the input is limited and the performance depends strongly on the input quality Ø Goal: maximise the number of addresses for which the UPRN returned by Address Index matches the baseline reference while keeping the false positives (wrong matches) acceptably low Ø On baseline datasets created by the subject expert or provided by users a correct match rate of 97. 5% has been achieved Sli. do #T 935
Address Index services RESTful API User web interface Bulk matching service The code is publicly available in github: https: //github. com/ONSdigital/address-index-data and https: //github. com/ONSdigital/address-index-api Sli. do #T 935
Questions #T 935 Sli. do #T 935
Address matching: Connecting data sources with fuzzy addresses Peter Hufton Data Science Anna Carlsson-Hyslop Statistics & English Housing Survey Sli. do #T 935
MHCLG single departmental plan “Fixing our broken housing market” —Housing white paper, Feb 2017 “Our objectives: 1. Deliver the homes the country needs. 2. Make the vision of a place you call home a reality. ” —MHCLG Single Departmental plan, May 2018 Sli. do #T 935 Peter Hufton & Anna Carlsson-Hyslop
Objective: Connecting data sources Sli. do #T 935 Peter Hufton & Anna Carlsson-Hyslop
Connecting different data sources Zoopla listing data Land Registry English Housing Survey Energy Performance Certificate identified We have better data available to: by address • Move closer♫to understanding how the housing market operates in real-time. ? ? ? Address matching • Better predictive modelling of policy. project • Better monitoring of the direct consequences of♪changes. Unique Property Reference Number (UPRN) Sli. do #T 935 Peter Hufton & Anna Carlsson-Hyslop
Challenge: Addresses are fallible Sli. do #T 935 Peter Hufton & Anna Carlsson-Hyslop
Addresses are fallible Sli. do #T 935 Peter Hufton & Anna Carlsson-Hyslop
Addresses are fallible Common problems include: • • • Spelling mistakes Abbreviations “Co Over-specification Incomplete information Outdated information Errors Addresses are fallible “Flat 2, Green House, New Road, Neath, SA 10 9 XX” “Fflat 2, Ty Wyrdd, Heol Newydd, Castell Nedd, SA 10 9 XX” An address typically arrives as an unstructured string. – How do we determine which parts are most important? Sli. do #T 935 Peter Hufton & Anna Carlsson-Hyslop
Address Matching outside of MHCLG Unique considerations: • Data sources: solution must tailored to our data sources; adaptable as our needs change • Scalability: Zoopla monthly update has a size of 120 GB. • Licensing/sensitivity: solution must be developed in-house. Sli. do #T 935 Peter Hufton & Anna Carlsson-Hyslop
Solution: An innovative approach to the problem of address matching Sli. do #T 935 Peter Hufton & Anna Carlsson-Hyslop
A solution to address matching by Sli. do #T 935 Peter Hufton & Anna Carlsson-Hyslop
How do we connect fuzzy addresses to the appropriate reference address? Sli. do #T 935 Peter Hufton & Anna Carlsson-Hyslop
Machine-learning methods • What makes a well-matched address? “ 4 Kings Court, 123 a Lordship Lane, East Dulwich SE 22 1 AB” vs “Flat 4 King’s Court, 123 A Lordship Lane, London SE 22 1 AB” Machine-learning programs × 1000 s Sli. do #T 935 Peter Hufton & Anna Carlsson-Hyslop
Okay: So how well do we do? Sli. do #T 935 Peter Hufton & Anna Carlsson-Hyslop
How well do we do? Zoopla listing data consider a sample of addresses from NG 18 English 3600 12000 Housing Land Registry addresses Survey Energy Performance Certificate Address matching project 98% with a single match (UPRN) 97% with a single match (UPRN) 88% Zoopla records can be connected to EPC Sli. do #T 935 Peter Hufton & Anna Carlsson-Hyslop
Use cases for address matching for the English Housing Survey Sli. do #T 935 Peter Hufton & Anna Carlsson-Hyslop
Conclusions: • Although addresses are fallible, we now have a robust solution for connecting ‘fuzzy’ addresses to their UPRN. • We can align now align data from disparate sources. • Address matching has specific uses in the English Housing Survey for: • Matching to council tax band • Matching to energy efficiency and usage • Reducing bias in the leasehold estimate Sli. do #T 935 Peter Hufton & Anna Carlsson-Hyslop
Questions Sli. do #T 935 Peter Hufton & Anna Carlsson-Hyslop
Next Sharing Seminar: Power BI Please contact Elizabeth. Brankley@ons. gov. uk with any questions Sli. do #T 935
- Slides: 36