Bellman A Data Quality Browser Tamraparni Dasu Theodore

Bellman: A Data Quality Browser Tamraparni Dasu Theodore Johnson S. Muthukrishnan Vladislav Shkapenyuk Amit Marathe Contact: marathe@research. att. com 1

Exploring Large and Complex Databases • AT&T has a lot of large databases. – Many types of offerings (business/consumer/hosting, voice: LD/local, data: frame/ip/vpn/voip, etc). – Many aspects of providing a service (sales/provisioning/billing/network maintenance/customer care). – Many databases, each with 100’s to 1000’s of tables. • These databases are complex – Legacy systems and procedures (AT&T is an old company). – AT&T’s scale requires local autonomy. • Different conventions and encodings in different domains – Complex interdependencies. • A user’s order touches many databases at some point or other. • A customer may order many types of services, etc. 2

Our Problem • We provide special consulting services to AT&T Business units. – Data mining studies. – Targeted data cleanup. – Help with large scale database integration and cleanup efforts. • “Special services” means that we are always looking at new data sets. – ODBC access to databases. – Web scraping – Delimited ascii dumps. • We need help … 3

What is Bellman? • A “data quality browser” • A platform for integrating both simple and advanced data browsing tools. • A platform for integrating data browsing and data cleanup tools. • A platform for browsing and correlating multiple disparate databases and data sets. • Current status: the platform and the tools are in good shape, the UI and integration could use some work. 4

Database Profiling • Collect summaries of the tables, fields, and data in the databases. • Store in a supplementary database. • Use these summaries to – Supplement database browsing. – Answer sophisticated queries about database structure. – Support data mining on the database structure. 5

Bellman History • Java-Bellman (first version) – Java/JDBC client software. – Collect profiles using C++/ODBC • Oracle, Sybase, Teradata, Informix, SQL Server – Problem : installation is difficult, especially to configure the ODBC drivers. • Web-Bellman (current version) – ASP. NET/C# front end, C++ profiling. – No client software installation required. – Centralized management of ODBC drivers and some advanced tools. 6

Bellman Screen Shots • Bellman is best shown through some examples. • Problem: – Its only interesting with real databases. – But the data is AT&T Proprietary • We’ll do our best to show Bellman without getting ourselves fired. 7

Development site Bellman is the browser Spider is an application of our approximate string matching tools. 8

Database Connections Ted’s account is configured to access these databases. Profile queries are restricted to them. 9

Tables 10

Fields 11

Find Similar • What other fields in the database contain similar data? – Many values in common (exact match) : resemblance – “Textually similar” • Many q-grams in common • Distribution of q-grams is similar. – The fields have a similar name. • Uses – Finding join paths which require field transforms – Finding other fields with names/IDs/etc. 12

Finding Related Fields 13

A Comparison 14

Another example 15

Comparison 16

Keys and Functional Dependencies 17

Profile tables 18

Algorithms • Field similarity by exact match – Min Hash signatures to compute resemblance • Field similarity by substring similarity – Resemblance of q-grams – Distribution of q-grams • Efficient key finding 19

Set Resemblance • The resemblance of two sets A, B is , |A • After some algebra, we find that |A , (|A|+|B|)/(1+ , ) • There are fast algorithms to compute … • Application to Bellman – The sets in question are the unique values in the fields – The resemblance of two fields is a good measure of their similarity. 20

Min Hash Signatures • h(a) is a hash function. • m(A) = min(h(a), a in A). • Observation: Pr[m(A) = m(B)] = , • Why? – Universe restricted to A B – Equality only in A B • Repeat the experiment many times. • Signature: (m 1(A), m 2(A), …, mk(A)) A m(A)<m(B) ignored m(A)=m(B) m(A)>m(B) B 21

Min Hash Implementation • (Schema, Table, Field) maps to unique ID • 256 hash functions • Bellman_Field_Sig table has the schema (ID, Hash. Num, Min. Hash) • Min. Hash is a 32 -bit integer (collisions become an issue for large sets) 22

Min Hash Implementation (cont. ) • SQL to find fields similar to field w/ ID 23 select S 2. ID, count(*) from Bellman_Field_Sig S 1, Bellman_Field_Sig S 2 where S 1. ID=23 and S 1. Hash. Num = S 2. Hash. Num and S 1. Min. Hash=S 2. Min. Hash group by S 2. ID order by count(*) desc 23

Textual similarity • Q-gram : collection of consecutive q-letter substrings. • Example: 3 -grams of “Bellman” – Bel, ell, llm, lma, man • Q-gram resemblance : compute signatures of the set of q-grams of a field instead of field values. • Q-gram similarity : L 2 distance of the frequency distribution of the q-grams of a field. – Use sketches to store a compact approximate summary of the q-gram distribution. 24

Q-gram Sketches • V is a d dimensional vector • Sk(V) is the k dimensional vector (V. X 1, V. X 2, …, V. Xk) where Xi are random d dimensional vectors • Typically, k << d • L 2(V 1, V 2) is approximated by L 2(Sk(V 1), Sk(V 2)) 25

Q-gram Sketch Implementation • 256 random sketch vectors • Bellman_Field_QSketch (ID, Sk. Num, Sk. Val) • Tuples correspnding to a particular ID constitute the normalized sketch vector for that field • Sk. Val has datatype float 26

Q-gram Sketch Implementation • SQL to find fields similar to field w/ ID 23 select S 2. ID, sum((S 1. Sk. Val – S 2. Sk. Val) ** 2) from Bellman_Field_QSketch S 1, Bellman_Field_QSketch S 2 where S 1. ID = 23 and S 1. Sk. Val = S 2. Sk. Val group by S 2. ID order by sum((S 1. Sk. Val – S 2. Sk. Val) ** 2) desc 27

Finding Keys • Key a (minimal) set of fields whose composite value is unique in every record. • Problem: finding all keys of length up to k in a table with F fields requires O(Fk) expensive count distinct queries. • Solution: – Eliminate “bad” fields : floats, mostly NULL, etc. – Collect an in-memory sample • Can store hashes of long strings. – Compute count distinct queries on the sample • Verify keys by query against database – Use Tane-style level-wise pruning. 28

Finding Keys (cont. ) • Upto 4 -way keys found by Bellman • Candidate-1 keys are all the “good” fields • In iteration i+1, consider the cross product of candidate-i keys and candidate-1 keys • Classify each set of i+1 fields so obtained as an approximate key, candidate-(i+1) key or a functional dependency. • Thresholds: key_threshold, min_improvement 29

Reference • Mining Database Structure: Or, How to Build a Data Quality Browser – Dasu, Johnson, Muthukrishnan, Shkapenyuk, ACM SIGMOD 2002. 30