Mining databases with different schema Integrating incompatible classifiers

Mining databases with different schema: Integrating incompatible classifiers Andreas L Prodromidis Salvatore Stolfo Dept of Computer Science Columbia University andreas@cs. columbia. edu sol@cs. columbia. edu

Reference material http: //www. cs. columbia. edu/~sal/JAM/ PROJECT/recent-project-papers. htm http: //www. cs. columbia. edu/~andreas/pu blications/publications. html Information has been updated in Andreas’ Ph. D. paper : "Management of Intelligent Learning Agents in Distributed Data Mining Systems" October 1999. n

Data Mining and Data Schema Mismatch Introduction n Database Compatibility n Meta-learning n Bridging Methods n Experiments and Evaluations n Conclusion n

Introduction The myth of an entirely local database n Can one algorithm give you everything? n There is distributed DM (JAM) and then there is distributed DM with different schemas n Prediction, Machine and Meta-learning n

Compatibility Data about the same topic - example Credit Card Transactions n Different banks record and store information differently n The same bank’s database will change over time. n The “incompatible-schema” problem n

Compatibility Similar but different data yield different classifier. n Classifiers depend on the structure of the data n Lately this has been discovered to be a problem hampering company mergers n

Meta-Learning n Meta-learning: Why? A way to deal with the scaling problem of distributed data sources. n What? A concept of deriving a higher level of information from already learned classifiers n – Meta classifiers are defined recursively as collections of classifiers structured in multilevel trees and determining the optimal set of classifiers is a combinatorial problem. – Must be pruned to be efficient

Meta-learning

Meta-learning n Methods: – voting – stacking – SCANN - stacking correspondence analysis and nearest neighbor – Other methods bagging n boosting n referreeing n arbitrating n

Bridging Methods n Databases with the same schema Schema(Dba)={A 1, A 2, A 3, …, An, C} n Schema(DBb)={B 1, B 2, B 3, …, Bn, C} n n Databases with one more attribute that the other where An+1 does not relate to Bn+1 or An+1 != Bn+1

Bridging Methods n Databases with one more attribute that the other where An+1 does not relate to Bn+1 or An+1 != Bn+1 – missing data and attribute value predictions Schema(Dba)={A 1, A 2, A 3, …, An+1, C} n Schema(DBb)={B 1, B 2, B 3, …, Bn+1, C} n – If you can’t predict - Null - average for the column, or most likely based on other reoccurring attribute combinations of other columns.

Bridging Methods n Databases with similar but different attributes – changing sizes and ranges of data values (normalization of time increments) – An+1 Bn+1 – Bridging methods must translate based on input from data experts or other normalization efforts and probabilities

Bridging Methods

Experiments and Evaluations n Working with CC data from 2 different banking institutions – First Union and Chase n Using 5 Mining algorithms to derive classifiers – – – n DT - CART, ID 3 and C 4. 5 NB - Bayes Rule Induction - Ripper based on IREP Using predictive algorithms to fill-in missing data with regression methods – CART, MARS, local weighted and linear

About the data n Chase Credit Card – 500, 000 records spanning one year – Evenly distributed – 20% fraud, 80% non fraud n First Union Credit Card – 500, 000 records spanning one year – Unevenly distributed – 15% fraud, 85% non fraud

About the differences n Chase includes 2 attributes not present in First Union data – Add two fictitious fields – Classifier agents support unknown values n Chase and First Union define an attribute with different semantics – Project Chase values on First Union semantics

Charts of Results Started out With an estimated saving of $325 K to $550 K 86 to 90 % Total accuracy with base level classifiers

Other results n With meta-classifiers used on First Union composed of Chase base and bridging classifiers – Accuracy improved to 95% – Est savings went up to $800 K

Other results n With meta-classifiers used composed of base from both chase and First Union base classifiers – Accuracy improved to almost 98% – Est saving went up to $900 K

Conclusion There is a lot of ground to cover n Distributed DM is a viable option for the scalability and performance issues. n This paper investigated the idea of using databases with differing schema and bridging those differences so that classifiers could be built and combined into meta-classifiers. n

Conclusion They conducted experiments and proved that meta-classifiers can be built using real Credit Card transaction data with different schemas n These meta-classifiers proved to be reasonably accurate in testing. n

Questions ? ? ?