Will Data Mining Change the Functions of DBMS

Will Data Mining Change the Functions of DBMS? Jiawei Han DAIS (Data And Information Systems) Lab University of Illinois at Urbana-Champaign

Will DM Be Integrated with DB Functions? n n DM: Already a functional component of DBMS q Microsoft/SQLServer: Analysis Manager q IBM/DB 2 & Intelligent. Miner q Oracle: Data Mining Package But will DM be “intruding” into DBMS, i. e. , be integrated with essential DBMS functions? q Indexing q Data integration q Data cleaning q Query processing

Indexing by Data Mining n n Indexing graphs? ─ # of subgraphs: exponential! q Chemical Informatics/bioinformatics … q Discriminative frequent graph patterns (SIGMOD’ 04) Indexing subsequences? q n Shopping sequence, DNA/protein sequence (SDM’ 05) When is discriminative frequent pattern indexing useful? q Complex objects, big (object) queries Sample database (a) (b) Query graph (c)

Data Cleaning by Data Mining n n Load messy data into a structured database? q Inconsistent data: age = “ 1946”? q Field mis-alignments q Glitches of data: completely messed up inputs q Missing/un-matching delimiters: XML, HTML data q Big field: BLOB, CLOB, multimedia and text Data mining q Data cleaning by distribution/outlier analysis q Dependency/correlation analysis q Schema-directed or schema “discovery”

Data Integration by Data Mining n n Linking and mining cross-over multiple data relations q Cross-mine (Classification across multiple data relations: ICDE’ 04) Search across heterogeneous databases q Object identification/merge, reference reconciliation (Alon’s group) q Mining across heterogeneous DBs q Personalizing data from heterogeneous sources

Query Processing by Data Mining n Query plan refinement based on query execution history n Better query planning by investigating additional data statistics q q Current optimizer: key/foreign key, cardinality, # distinct values Additional information: n Strong dependency/correlation n Histogram, dense vs. sparse regions, etc.

Conclusions n n n DBers have been “invading” into DM and made great contributions It is time to consider that DM may invade DBMS to enhance its functionality General philosophy q Invisible data mining n Google is doing this for page ranking successfully n Can we do it to enhance DBMS? q You can do better if you know your data better!
- Slides: 7